Self-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.
In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.
The data is store in NLSY79.csv.
Here are the description of variables:
Personal Demographic Variables
Household Environment
Variables Related to ASVAB test Scores in 1981
| Test | Description |
|---|---|
| AFQT | percentile score on the AFQT intelligence test in 1981 |
| Coding | score on the Coding Speed test in 1981 |
| Auto | score on the Automotive and Shop test in 1981 |
| Mechanic | score on the Mechanic test in 1981 |
| Elec | score on the Electronics Information test in 1981 |
| Science | score on the General Science test in 1981 |
| Math | score on the Math test in 1981 |
| Arith | score on the Arithmetic Reasoning test in 1981 |
| Word | score on the Word Knowledge Test in 1981 |
| Parag | score on the Paragraph Comprehension test in 1981 |
| Numer | score on the Numerical Operations test in 1981 |
Self-Esteem test 81 and 87
We have two sets of self-esteem test, one in 1981 and the other in
1987. Each set has same 10 questions. They are labeled as
Esteem81 and Esteem87 respectively followed by
the question number. For example, Esteem81_1 is Esteem
question 1 in 81.
The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree
Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?
## 'data.frame': 2431 obs. of 46 variables:
## $ Subject : int 2 6 7 8 9 13 16 17 18 20 ...
## $ Gender : chr "female" "male" "male" "female" ...
## $ Education05 : int 12 16 12 14 14 16 13 13 13 17 ...
## $ Income87 : int 16000 18000 0 9000 15000 2200 27000 20000 28000 27000 ...
## $ Job05 : chr "4700 TO 4960: Sales and Related Workers" "10 TO 430: Executive, Administrative and Managerial Occupations" "7900 TO 8960: Setters, Operators and Tenders" "5000 TO 5930: Office and Administrative Support Workers" ...
## $ Income05 : int 5500 65000 19000 36000 65000 8000 71000 43000 120000 64000 ...
## $ Weight05 : int 160 187 175 246 180 235 160 188 173 130 ...
## $ HeightFeet05 : int 5 5 5 5 5 6 5 5 5 5 ...
## $ HeightInch05 : int 2 5 9 3 6 0 4 10 9 4 ...
## $ Imagazine : int 1 0 1 1 1 1 1 1 1 1 ...
## $ Inewspaper : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ilibrary : int 1 1 1 1 1 1 1 1 1 1 ...
## $ MotherEd : int 5 12 12 9 12 12 12 12 12 12 ...
## $ FatherEd : int 8 12 12 6 10 16 12 15 16 18 ...
## $ FamilyIncome78: int 20000 35000 8502 7227 17000 20000 48000 15000 4510 50000 ...
## $ Science : int 6 23 14 18 17 16 13 19 22 21 ...
## $ Arith : int 8 30 14 13 21 30 17 29 30 17 ...
## $ Word : int 15 35 27 35 28 29 30 33 35 28 ...
## $ Parag : int 6 15 8 12 10 13 12 13 14 14 ...
## $ Number : int 29 45 32 24 40 36 49 35 48 39 ...
## $ Coding : int 52 68 35 48 46 30 58 58 61 54 ...
## $ Auto : int 9 21 13 11 13 21 11 18 21 18 ...
## $ Math : int 6 23 11 4 13 24 17 21 23 20 ...
## $ Mechanic : int 10 21 9 12 13 19 11 19 16 20 ...
## $ Elec : int 5 19 11 12 15 16 10 16 17 13 ...
## $ AFQT : num 6.84 99.39 47.41 44.02 59.68 ...
## $ Esteem81_1 : int 1 2 2 1 1 1 2 2 2 1 ...
## $ Esteem81_2 : int 1 1 1 1 1 1 2 2 2 1 ...
## $ Esteem81_3 : int 4 4 3 3 4 4 3 3 3 3 ...
## $ Esteem81_4 : int 1 2 2 2 1 1 2 2 2 1 ...
## $ Esteem81_5 : int 3 4 3 3 1 4 3 3 3 3 ...
## $ Esteem81_6 : int 3 2 2 2 1 1 2 2 2 2 ...
## $ Esteem81_7 : int 1 2 2 3 1 1 3 2 2 1 ...
## $ Esteem81_8 : int 3 4 2 3 4 4 3 3 3 3 ...
## $ Esteem81_9 : int 3 3 3 3 4 4 3 3 3 3 ...
## $ Esteem81_10 : int 3 4 3 3 4 4 3 3 3 3 ...
## $ Esteem87_1 : int 2 1 2 1 1 1 1 2 1 1 ...
## $ Esteem87_2 : int 1 1 2 1 1 1 1 2 1 1 ...
## $ Esteem87_3 : int 4 4 4 3 4 4 4 3 4 4 ...
## $ Esteem87_4 : int 1 1 2 1 1 1 2 2 1 4 ...
## $ Esteem87_5 : int 2 4 4 4 4 4 4 3 4 4 ...
## $ Esteem87_6 : int 2 1 2 2 1 1 2 2 1 1 ...
## $ Esteem87_7 : int 2 2 2 1 1 2 2 2 2 1 ...
## $ Esteem87_8 : int 3 3 4 2 4 4 4 3 4 3 ...
## $ Esteem87_9 : int 3 2 3 2 4 4 3 3 3 4 ...
## $ Esteem87_10 : int 4 4 4 2 4 4 4 3 4 4 ...
## [1] FALSE
## Subject Gender Education05 Income87
## Min. : 2 Length:2431 Min. : 6.0 Min. : -2
## 1st Qu.: 1592 Class :character 1st Qu.:12.0 1st Qu.: 4500
## Median : 3137 Mode :character Median :13.0 Median :12000
## Mean : 3504 Mean :13.9 Mean :13399
## 3rd Qu.: 4668 3rd Qu.:16.0 3rd Qu.:19000
## Max. :12140 Max. :20.0 Max. :59387
## Job05 Income05 Weight05 HeightFeet05
## Length:2431 Min. : 63 Min. : 81 Min. :-4.00
## Class :character 1st Qu.: 22650 1st Qu.:150 1st Qu.: 5.00
## Mode :character Median : 38500 Median :180 Median : 5.00
## Mean : 49415 Mean :183 Mean : 5.18
## 3rd Qu.: 61350 3rd Qu.:209 3rd Qu.: 5.00
## Max. :703637 Max. :380 Max. : 8.00
## HeightInch05 Imagazine Inewspaper Ilibrary MotherEd
## Min. : 0.00 Min. :0.000 Min. :0.000 Min. :0.00 Min. : 0.0
## 1st Qu.: 2.00 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:11.0
## Median : 5.00 Median :1.000 Median :1.000 Median :1.00 Median :12.0
## Mean : 5.32 Mean :0.718 Mean :0.861 Mean :0.77 Mean :11.7
## 3rd Qu.: 8.00 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.00 3rd Qu.:12.0
## Max. :11.00 Max. :1.000 Max. :1.000 Max. :1.00 Max. :20.0
## FatherEd FamilyIncome78 Science Arith Word
## Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:10.0 1st Qu.:11167 1st Qu.:13.0 1st Qu.:13.0 1st Qu.:23.0
## Median :12.0 Median :20000 Median :17.0 Median :19.0 Median :28.0
## Mean :11.9 Mean :21252 Mean :16.3 Mean :18.6 Mean :26.6
## 3rd Qu.:14.0 3rd Qu.:27500 3rd Qu.:20.0 3rd Qu.:25.0 3rd Qu.:32.0
## Max. :20.0 Max. :75001 Max. :25.0 Max. :30.0 Max. :35.0
## Parag Number Coding Auto Math
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:10.0 1st Qu.:29.0 1st Qu.:38.0 1st Qu.:10.0 1st Qu.: 9.0
## Median :12.0 Median :36.0 Median :48.0 Median :14.0 Median :14.0
## Mean :11.2 Mean :35.5 Mean :47.1 Mean :14.3 Mean :14.3
## 3rd Qu.:14.0 3rd Qu.:44.0 3rd Qu.:57.0 3rd Qu.:18.0 3rd Qu.:20.0
## Max. :15.0 Max. :50.0 Max. :84.0 Max. :25.0 Max. :25.0
## Mechanic Elec AFQT Esteem81_1 Esteem81_2
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :1.00 Min. :1.00
## 1st Qu.:11.0 1st Qu.: 9.0 1st Qu.: 31.9 1st Qu.:1.00 1st Qu.:1.00
## Median :14.0 Median :12.0 Median : 57.0 Median :1.00 Median :1.00
## Mean :14.4 Mean :11.6 Mean : 54.7 Mean :1.42 Mean :1.42
## 3rd Qu.:18.0 3rd Qu.:15.0 3rd Qu.: 78.2 3rd Qu.:2.00 3rd Qu.:2.00
## Max. :25.0 Max. :20.0 Max. :100.0 Max. :4.00 Max. :4.00
## Esteem81_3 Esteem81_4 Esteem81_5 Esteem81_6 Esteem81_7
## Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.:3.00 1st Qu.:1.00 1st Qu.:3.00 1st Qu.:1.00 1st Qu.:1.00
## Median :4.00 Median :2.00 Median :4.00 Median :2.00 Median :2.00
## Mean :3.51 Mean :1.57 Mean :3.46 Mean :1.62 Mean :1.75
## 3rd Qu.:4.00 3rd Qu.:2.00 3rd Qu.:4.00 3rd Qu.:2.00 3rd Qu.:2.00
## Max. :4.00 Max. :4.00 Max. :4.00 Max. :4.00 Max. :4.00
## Esteem81_8 Esteem81_9 Esteem81_10 Esteem87_1 Esteem87_2
## Min. :1.00 Min. :1.00 Min. :1.0 Min. :1.00 Min. :1.0
## 1st Qu.:3.00 1st Qu.:3.00 1st Qu.:3.0 1st Qu.:1.00 1st Qu.:1.0
## Median :3.00 Median :3.00 Median :3.0 Median :1.00 Median :1.0
## Mean :3.13 Mean :3.16 Mean :3.4 Mean :1.38 Mean :1.4
## 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:4.0 3rd Qu.:2.00 3rd Qu.:2.0
## Max. :4.00 Max. :4.00 Max. :4.0 Max. :4.00 Max. :4.0
## Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6 Esteem87_7
## Min. :1.00 Min. :1.0 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.:3.00 1st Qu.:1.0 1st Qu.:3.00 1st Qu.:1.00 1st Qu.:1.00
## Median :4.00 Median :1.0 Median :4.00 Median :2.00 Median :2.00
## Mean :3.58 Mean :1.5 Mean :3.53 Mean :1.59 Mean :1.72
## 3rd Qu.:4.00 3rd Qu.:2.0 3rd Qu.:4.00 3rd Qu.:2.00 3rd Qu.:2.00
## Max. :4.00 Max. :4.0 Max. :4.00 Max. :4.00 Max. :4.00
## Esteem87_8 Esteem87_9 Esteem87_10
## Min. :1.0 Min. :1.00 Min. :1.00
## 1st Qu.:3.0 1st Qu.:3.00 1st Qu.:3.00
## Median :3.0 Median :3.00 Median :3.00
## Mean :3.1 Mean :3.06 Mean :3.37
## 3rd Qu.:4.0 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :4.0 Max. :4.00 Max. :4.00
## [1] ""
## [2] "10 TO 430: Executive, Administrative and Managerial Occupations"
## [3] "1000 TO 1240: Mathematical and Computer Scientists"
## [4] "1300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians"
## [5] "1600 TO 1760: Physical Scientists"
## [6] "1800 TO 1860: Social Scientists and Related Workers"
## [7] "1900 TO 1960: Life, Physical and Social Science Technicians"
## [8] "2000 TO 2060: Counselors, Sociala and Religious Workers"
## [9] "2100 TO 2150: Lawyers, Judges and Legal Support Workers"
## [10] "2200 TO 2340: Teachers"
## [11] "2400 TO 2550: Education, Training and Library Workers"
## [12] "2600 TO 2760: Entertainers and Performers, Sports and Related Workers"
## [13] "2800 TO 2960: Media and Communications Workers"
## [14] "3000 TO 3260: Health Diagnosing and Treating Practitioners"
## [15] "3300 TO 3650: Health Care Technical and Support Occupations"
## [16] "3700 TO 3950: Protective Service Occupations"
## [17] "4000 TO 4160: Food Preparation and Serving Related Occupations"
## [18] "4200 TO 4250: Cleaning and Building Service Occupations"
## [19] "4300 TO 4430: Entertainment Attendants and Related Workers"
## [20] "4500 TO 4650: Personal Care and Service Workers"
## [21] "4700 TO 4960: Sales and Related Workers"
## [22] "500 TO 950: Management Related Occupations"
## [23] "5000 TO 5930: Office and Administrative Support Workers"
## [24] "6000 TO 6130: Farming, Fishing and Forestry Occupations"
## [25] "6200 TO 6940: Construction Trade and Extraction Workers"
## [26] "7000 TO 7620: Installation, Maintenance and Repairs Workers"
## [27] "7700 TO 7750: Production and Operating Workers"
## [28] "7800 TO 7850: Food Preparation Occupations"
## [29] "7900 TO 8960: Setters, Operators and Tenders"
## [30] "9000 TO 9750: Transportation and Material Moving Workers"
## [31] "9990: Uncodeable"
##
##
## 56
## 10 TO 430: Executive, Administrative and Managerial Occupations
## 377
## 1000 TO 1240: Mathematical and Computer Scientists
## 64
## 1300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians
## 53
## 1600 TO 1760: Physical Scientists
## 4
## 1800 TO 1860: Social Scientists and Related Workers
## 6
## 1900 TO 1960: Life, Physical and Social Science Technicians
## 7
## 2000 TO 2060: Counselors, Sociala and Religious Workers
## 41
## 2100 TO 2150: Lawyers, Judges and Legal Support Workers
## 15
## 2200 TO 2340: Teachers
## 120
## 2400 TO 2550: Education, Training and Library Workers
## 29
## 2600 TO 2760: Entertainers and Performers, Sports and Related Workers
## 24
## 2800 TO 2960: Media and Communications Workers
## 13
## 3000 TO 3260: Health Diagnosing and Treating Practitioners
## 74
## 3300 TO 3650: Health Care Technical and Support Occupations
## 99
## 3700 TO 3950: Protective Service Occupations
## 54
## 4000 TO 4160: Food Preparation and Serving Related Occupations
## 68
## 4200 TO 4250: Cleaning and Building Service Occupations
## 67
## 4300 TO 4430: Entertainment Attendants and Related Workers
## 10
## 4500 TO 4650: Personal Care and Service Workers
## 42
## 4700 TO 4960: Sales and Related Workers
## 205
## 500 TO 950: Management Related Occupations
## 108
## 5000 TO 5930: Office and Administrative Support Workers
## 360
## 6000 TO 6130: Farming, Fishing and Forestry Occupations
## 9
## 6200 TO 6940: Construction Trade and Extraction Workers
## 135
## 7000 TO 7620: Installation, Maintenance and Repairs Workers
## 108
## 7700 TO 7750: Production and Operating Workers
## 49
## 7800 TO 7850: Food Preparation Occupations
## 4
## 7900 TO 8960: Setters, Operators and Tenders
## 112
## 9000 TO 9750: Transportation and Material Moving Workers
## 117
## 9990: Uncodeable
## 1
Let concentrate on Esteem scores evaluated in 87.
Esteem variables.
Pay attention to missing values, any peculiar numbers etc. How do you
fix problems discovered if there is any? Briefly describe what you have
done for the data preparation.## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 1.38 2.00 4.00
## [1] FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 1.0 1.4 2.0 4.0
## [1] FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 3.58 4.00 4.00
## [1] FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 1.0 1.5 2.0 4.0
## [1] FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 3.53 4.00 4.00
## [1] FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 1.59 2.00 4.00
## [1] FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 1.72 2.00 4.00
## [1] FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 3.0 3.1 4.0 4.0
## [1] FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 3.00 3.06 4.00 4.00
## [1] FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 3.00 3.37 4.00 4.00
## [1] FALSE
The first thing I did was create a summary of all the data to provide basic insights into the distribution of the Esteem_87 scores. After this I checked for missing values (of which there are none) and exmined the intepretation of the scores more carefully. From this, I understood that there were some questions which were framed in such a way that higher scores indicated higher levels of self-esteem, and other questions which were framed in such a way that lower score indicated higher self-esteem. This needs to be standardised across all the questions to ensure easy comparison across the different questions in Esteem_87.
data.esteem, you can use the code
data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)]
to invert the scores.To fix this, I identified questions which were framed in a positive way (Questions 1, 2, 4, 6 and 7). This meant that lower scores (“Strongly Agree”) indicated higher self-esteem. I inverted these scores, creating a standardised measure where higher scores across all questions indicated higher self-esteem.
Esteem87_1 through to Esteem87_5 are highly left-skewed, meaning that the vast majority of tests scores are 3 and 4, and with means of 3.6, 3.58, 3.5, 3.53 and 3.41 respective Whilst the remaining Esteem87_6 through to Esteem87_10 are still left-skewed, they are to a lesser extent, with means of 3.28, 3.1, 3.06 and 3.37.
## Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6
## Esteem87_1 1.000 0.704 0.448 0.528 0.399 0.464
## Esteem87_2 0.704 1.000 0.443 0.551 0.402 0.481
## Esteem87_3 0.448 0.443 1.000 0.408 0.549 0.410
## Esteem87_4 0.528 0.551 0.408 1.000 0.381 0.509
## Esteem87_5 0.399 0.402 0.549 0.381 1.000 0.405
## Esteem87_6 0.464 0.481 0.410 0.509 0.405 1.000
## Esteem87_7 0.379 0.410 0.343 0.422 0.370 0.600
## Esteem87_8 0.273 0.283 0.351 0.295 0.381 0.409
## Esteem87_9 0.236 0.259 0.349 0.287 0.354 0.364
## Esteem87_10 0.312 0.330 0.460 0.366 0.436 0.442
## Esteem87_7 Esteem87_8 Esteem87_9 Esteem87_10
## Esteem87_1 0.379 0.273 0.236 0.312
## Esteem87_2 0.410 0.283 0.259 0.330
## Esteem87_3 0.343 0.351 0.349 0.460
## Esteem87_4 0.422 0.295 0.287 0.366
## Esteem87_5 0.370 0.381 0.354 0.436
## Esteem87_6 0.600 0.409 0.364 0.442
## Esteem87_7 1.000 0.389 0.352 0.390
## Esteem87_8 0.389 1.000 0.430 0.438
## Esteem87_9 0.352 0.430 1.000 0.579
## Esteem87_10 0.390 0.438 0.579 1.000
All of the scores are positively correlated, with a minimum correlation between Esteem87_1 & Esteem87_8 (0.273) and Esteem87_1 & Esteem87_9 (0.236), and a maximum between Esteemed87_1 & Esteemed87_2 (0.704).
PCA on 10 esteem measurements. (centered but no scaling)
## PC1 PC2
## Esteem87_1 0.324 -0.4452
## Esteem87_2 0.333 -0.4283
## Esteem87_3 0.322 0.0115
## Esteem87_4 0.324 -0.2877
## Esteem87_5 0.315 0.0793
## Esteem87_6 0.347 -0.0492
## Esteem87_7 0.315 0.0196
## Esteem87_8 0.280 0.3619
## Esteem87_9 0.277 0.4917
## Esteem87_10 0.318 0.3918
Yes, both PC1 and PC2 loadings are orthogonal, unit vectors.
b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)
Loadings are direction vectors that define each PC. Large absolute loadings indicate a strong contribution, and the signs indicate in which direction they move relative to each other. In this case, all the PC1 scores are positive, indicating that all the variables move in the same directon together. Furthermore, Esteem87_6 (0.347) and Esteem87_2 (0.333) are the most significant loadings. Looking at PC2, we see that Esteem87_1, Esteem87_2, Esteem87_4 and Esteem87_6 are negative whilst the remaining are positive, indicating that these two sets of variables move in contrasting directions. Furthermore, Esteem87_9 (0.4917), Esteem87_1 (-0.4452) and Esteem87_2 (-0.4283) are the most significant loadings.
c) How is the PC1 score obtained for each subject? Write down the formula.
The PC1 score for variable i is obtained as a linear combination of the standardised Esteem87 variables, each weighted by their corresponding loadings. In this case, the PC1 score is obtained using the formula: PC1i = 0.324Zi1 + 0.333Zi2 + 0.322Zi3 + ... + 0.318Zi10.
d) Are PC1 scores and PC2 scores in the data uncorrelated?
Yes, the PC1 and PC2 scores are uncorrelated because PC1 is orthogonal to PC2.
e) Plot PVE (Proportion of Variance Explained) and summarize the plot.
<img src="hw2_sp2026_files/figure-html/unnamed-chunk-7-1.png" width="768" />
f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?
<img src="hw2_sp2026_files/figure-html/unnamed-chunk-8-1.png" width="768" />
From this, we can see that 60% of the variance is explained by the first two variables.
g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data. Give an interpretation of PC1 and PC2 from the plot.
From this, we can see that all loadings for PC1 (x-axis) point in the positive direction, indicating a positive level of self-esteem across all variables. Since there are no extreme loadings weightings, PC1 is essentially an average of all of the loadings, and as such, can be interpreted as a general level of self-esteem.
Conversely, looking at PC2 (y-axis) we see that the loadings go in both the positive and negative directions, breaking the variable loadings into two different groups. This is likely visualise the different effects of positively-worded and negatively-worded questions. With the exception of Esteem87_7 "I am satisfied with myself", all of the negatively-worded questions, reflect better (positive) self-esteem compared with positvely-worded questions.
Apply k-means to cluster subjects on the original esteem scores
Looking at the Total Within-Cluster Sum of Squares, we can identify an elbow at 3 clusters
b) Can you summarize common features within each cluster?
## [1] 843 697 891
## PC1 PC2
## 1 -2.3627 0.600
## 2 -0.0473 -1.130
## 3 2.2725 0.316
Cluster 1 contains 843 observations and is centred at (-2.3627, 0.6000); Cluster 2 contains 697 observations and is centred at (-0.0473, -1.1300); and Cluster 3 contains 891 observations and is centred at (2.2725, 0.3160).
Going of my interpretation of PC1 as overall self-esteem, we can then classify Cluster 1 as low self-esteem, Cluster 2 as average self-esteem and Cluster 3 as high self-esteem. When examining PC2 above, we suggested that it may be the differing tone in which the questions were framed (positive and negative). Clusters 1 and 3 contain mostly positive values with some negative values, whereas Cluster 2 contains mostly negative values. This interpretation does not apply to these clusters because there are three clusters, rather than 2, and each cluster contains a range of positive and negative values. As such, this clusters around some factor impacting PC2, however, we were unable to find a clear interpretation.
c) Can you visualize the clusters with somewhat clear boundaries? You may try different pairs of variables and different PC pairs of the esteem scores.
Note, in this case, we have chosen to only cluster around PCs due to the potential for multicollinearity between variables in the data and the presence of unwanted noise.
## [1] 799 776 856
## PC1 PC3
## 1 -0.00167 -0.704
## 2 -2.56519 0.339
## 3 2.32701 0.350
## [1] 605 1060 766
## PC2 PC3
## 1 0.638 -0.98529
## 2 0.560 0.55887
## 3 -1.279 0.00482
We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.
Prepare possible factors/variables:
Firstly, we have conducted PCA on the ASVAB dataset, extracting PC1 scores and adding them to the dataset as a general level of intelligence.
Next, we will create a BMI variable to summarise an individual’s body height and weight.
Finally, we are going to remove the unwanted variables from the dataset (specifically, Esteem81 scores, AFQT scores except for AFQT). The primary reason for this is that these variables are already described through other variables such as Intelligence or Esteem, or they are not needed (like Esteem81).
## 'data.frame': 2431 obs. of 15 variables:
## $ Subject : int 2 6 7 8 9 13 16 17 18 20 ...
## $ Gender : chr "female" "male" "male" "female" ...
## $ Education05 : int 12 16 12 14 14 16 13 13 13 17 ...
## $ Income87 : int 16000 18000 0 9000 15000 2200 27000 20000 28000 27000 ...
## $ Job05 : chr "4700 TO 4960: Sales and Related Workers" "10 TO 430: Executive, Administrative and Managerial Occupations" "7900 TO 8960: Setters, Operators and Tenders" "5000 TO 5930: Office and Administrative Support Workers" ...
## $ Income05 : int 5500 65000 19000 36000 65000 8000 71000 43000 120000 64000 ...
## $ Imagazine : int 1 0 1 1 1 1 1 1 1 1 ...
## $ Inewspaper : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ilibrary : int 1 1 1 1 1 1 1 1 1 1 ...
## $ MotherEd : int 5 12 12 9 12 12 12 12 12 12 ...
## $ FatherEd : int 8 12 12 6 10 16 12 15 16 18 ...
## $ FamilyIncome78: int 20000 35000 8502 7227 17000 20000 48000 15000 4510 50000 ...
## $ Esteem_PC1 : num -0.54 1.4 -0.38 -0.62 3.07 ...
## $ Intelligence : num -4.366 4.545 -1.603 -0.872 0.312 ...
## $ BMI : num 29.3 31.1 25.8 43.6 29.1 ...
Following the data preparation, we will conduct some EDA to gain a sense of the structure of the final dataset as well as the distribution of each variable.
## Subject Gender Education05 Income87
## Min. : 2 Length:2431 Min. : 6.0 Min. : -2
## 1st Qu.: 1592 Class :character 1st Qu.:12.0 1st Qu.: 4500
## Median : 3137 Mode :character Median :13.0 Median :12000
## Mean : 3504 Mean :13.9 Mean :13399
## 3rd Qu.: 4668 3rd Qu.:16.0 3rd Qu.:19000
## Max. :12140 Max. :20.0 Max. :59387
## Job05 Income05 Imagazine Inewspaper
## Length:2431 Min. : 63 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 22650 1st Qu.:0.000 1st Qu.:1.000
## Mode :character Median : 38500 Median :1.000 Median :1.000
## Mean : 49415 Mean :0.718 Mean :0.861
## 3rd Qu.: 61350 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :703637 Max. :1.000 Max. :1.000
## Ilibrary MotherEd FatherEd FamilyIncome78 Esteem_PC1
## Min. :0.00 Min. : 0.0 Min. : 0.0 Min. : 0 Min. :-9.0499
## 1st Qu.:1.00 1st Qu.:11.0 1st Qu.:10.0 1st Qu.:11167 1st Qu.:-1.8806
## Median :1.00 Median :12.0 Median :12.0 Median :20000 Median : 0.0669
## Mean :0.77 Mean :11.7 Mean :11.9 Mean :21252 Mean : 0.0000
## 3rd Qu.:1.00 3rd Qu.:12.0 3rd Qu.:14.0 3rd Qu.:27500 3rd Qu.: 1.9200
## Max. :1.00 Max. :20.0 Max. :20.0 Max. :75001 Max. : 3.0734
## Intelligence BMI
## Min. :-9.667 Min. : 11.9
## 1st Qu.:-1.849 1st Qu.: 24.1
## Median : 0.342 Median : 27.3
## Mean : 0.000 Mean : 28.1
## 3rd Qu.: 2.120 3rd Qu.: 30.9
## Max. : 5.026 Max. :169.5
b) Run a few regression models between PC1 of all the esteem scores in 87 and suitable variables listed in a). Find a final best model with your **own clearly defined criterion**.
We will conduct both a forward and backwards stepwise Multiple Linear Regression Model and choose the model which minimises MSE and maximises r-squared. Esteem_PC1 will be our dependent variable, and the remaining factors our independent variables.
We first conduct a backward step regression model which starts with all the variables and removes the least significant, until removing more variables does not improve the performance of the model.
##
## Call:
## lm(formula = Esteem_PC1 ~ Education05 + Income87 + Job05 + Income05 +
## Inewspaper + Ilibrary + Intelligence, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.393 -1.558 0.002 1.672 5.136
##
## Coefficients:
## Estimate
## (Intercept) -2.20e+00
## Education05 7.74e-02
## Income87 1.30e-05
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 5.42e-01
## Job051000 TO 1240: Mathematical and Computer Scientists 6.82e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians -1.27e-01
## Job051600 TO 1760: Physical Scientists -1.52e+00
## Job051800 TO 1860: Social Scientists and Related Workers -6.00e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians 2.37e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 3.12e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 1.24e-01
## Job052200 TO 2340: Teachers 4.28e-01
## Job052400 TO 2550: Education, Training and Library Workers 4.43e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 1.24e+00
## Job052800 TO 2960: Media and Communications Workers 5.16e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 6.32e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations -1.80e-01
## Job053700 TO 3950: Protective Service Occupations 1.03e+00
## Job054000 TO 4160: Food Preparation and Serving Related Occupations -1.68e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations -1.67e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers -1.14e+00
## Job054500 TO 4650: Personal Care and Service Workers 5.29e-01
## Job054700 TO 4960: Sales and Related Workers 4.31e-01
## Job05500 TO 950: Management Related Occupations 8.43e-01
## Job055000 TO 5930: Office and Administrative Support Workers 4.92e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 1.36e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers 6.85e-02
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 2.50e-01
## Job057700 TO 7750: Production and Operating Workers 2.51e-01
## Job057800 TO 7850: Food Preparation Occupations 4.59e-01
## Job057900 TO 8960: Setters, Operators and Tenders 2.94e-01
## Job059000 TO 9750: Transportation and Material Moving Workers -1.42e-01
## Job059990: Uncodeable 2.02e-02
## Income05 4.61e-06
## Inewspaper 2.99e-01
## Ilibrary 1.44e-01
## Intelligence 1.24e-01
## Std. Error
## (Intercept) 4.27e-01
## Education05 2.29e-02
## Income87 3.87e-06
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 2.91e-01
## Job051000 TO 1240: Mathematical and Computer Scientists 3.72e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 3.88e-01
## Job051600 TO 1760: Physical Scientists 1.04e+00
## Job051800 TO 1860: Social Scientists and Related Workers 8.67e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians 8.08e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 4.18e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 5.93e-01
## Job052200 TO 2340: Teachers 3.35e-01
## Job052400 TO 2550: Education, Training and Library Workers 4.62e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 4.93e-01
## Job052800 TO 2960: Media and Communications Workers 6.21e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 3.61e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations 3.37e-01
## Job053700 TO 3950: Protective Service Occupations 3.84e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations 3.65e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations 3.66e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers 6.91e-01
## Job054500 TO 4650: Personal Care and Service Workers 4.12e-01
## Job054700 TO 4960: Sales and Related Workers 3.04e-01
## Job05500 TO 950: Management Related Occupations 3.33e-01
## Job055000 TO 5930: Office and Administrative Support Workers 2.90e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 7.26e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers 3.21e-01
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 3.33e-01
## Job057700 TO 7750: Production and Operating Workers 3.94e-01
## Job057800 TO 7850: Food Preparation Occupations 1.04e+00
## Job057900 TO 8960: Setters, Operators and Tenders 3.31e-01
## Job059000 TO 9750: Transportation and Material Moving Workers 3.28e-01
## Job059990: Uncodeable 2.03e+00
## Income05 1.04e-06
## Inewspaper 1.27e-01
## Ilibrary 1.02e-01
## Intelligence 2.04e-02
## t value
## (Intercept) -5.14
## Education05 3.38
## Income87 3.36
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 1.86
## Job051000 TO 1240: Mathematical and Computer Scientists 1.83
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians -0.33
## Job051600 TO 1760: Physical Scientists -1.45
## Job051800 TO 1860: Social Scientists and Related Workers -0.69
## Job051900 TO 1960: Life, Physical and Social Science Technicians 0.29
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 0.75
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 0.21
## Job052200 TO 2340: Teachers 1.28
## Job052400 TO 2550: Education, Training and Library Workers 0.96
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 2.52
## Job052800 TO 2960: Media and Communications Workers 0.83
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 1.75
## Job053300 TO 3650: Health Care Technical and Support Occupations -0.53
## Job053700 TO 3950: Protective Service Occupations 2.68
## Job054000 TO 4160: Food Preparation and Serving Related Occupations -0.46
## Job054200 TO 4250: Cleaning and Building Service Occupations -0.46
## Job054300 TO 4430: Entertainment Attendants and Related Workers -1.65
## Job054500 TO 4650: Personal Care and Service Workers 1.29
## Job054700 TO 4960: Sales and Related Workers 1.42
## Job05500 TO 950: Management Related Occupations 2.53
## Job055000 TO 5930: Office and Administrative Support Workers 1.70
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 0.19
## Job056200 TO 6940: Construction Trade and Extraction Workers 0.21
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 0.75
## Job057700 TO 7750: Production and Operating Workers 0.64
## Job057800 TO 7850: Food Preparation Occupations 0.44
## Job057900 TO 8960: Setters, Operators and Tenders 0.89
## Job059000 TO 9750: Transportation and Material Moving Workers -0.43
## Job059990: Uncodeable 0.01
## Income05 4.43
## Inewspaper 2.34
## Ilibrary 1.41
## Intelligence 6.08
## Pr(>|t|)
## (Intercept) 3.0e-07
## Education05 0.00075
## Income87 0.00079
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 0.06237
## Job051000 TO 1240: Mathematical and Computer Scientists 0.06678
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 0.74262
## Job051600 TO 1760: Physical Scientists 0.14623
## Job051800 TO 1860: Social Scientists and Related Workers 0.48877
## Job051900 TO 1960: Life, Physical and Social Science Technicians 0.76896
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 0.45472
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 0.83490
## Job052200 TO 2340: Teachers 0.20125
## Job052400 TO 2550: Education, Training and Library Workers 0.33756
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 0.01188
## Job052800 TO 2960: Media and Communications Workers 0.40623
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 0.08001
## Job053300 TO 3650: Health Care Technical and Support Occupations 0.59297
## Job053700 TO 3950: Protective Service Occupations 0.00751
## Job054000 TO 4160: Food Preparation and Serving Related Occupations 0.64525
## Job054200 TO 4250: Cleaning and Building Service Occupations 0.64848
## Job054300 TO 4430: Entertainment Attendants and Related Workers 0.09992
## Job054500 TO 4650: Personal Care and Service Workers 0.19870
## Job054700 TO 4960: Sales and Related Workers 0.15605
## Job05500 TO 950: Management Related Occupations 0.01151
## Job055000 TO 5930: Office and Administrative Support Workers 0.08928
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 0.85173
## Job056200 TO 6940: Construction Trade and Extraction Workers 0.83113
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 0.45227
## Job057700 TO 7750: Production and Operating Workers 0.52489
## Job057800 TO 7850: Food Preparation Occupations 0.66006
## Job057900 TO 8960: Setters, Operators and Tenders 0.37412
## Job059000 TO 9750: Transportation and Material Moving Workers 0.66501
## Job059990: Uncodeable 0.99207
## Income05 9.7e-06
## Inewspaper 0.01918
## Ilibrary 0.15800
## Intelligence 1.4e-09
##
## (Intercept) ***
## Education05 ***
## Income87 ***
## Job0510 TO 430: Executive, Administrative and Managerial Occupations .
## Job051000 TO 1240: Mathematical and Computer Scientists .
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians
## Job051600 TO 1760: Physical Scientists
## Job051800 TO 1860: Social Scientists and Related Workers
## Job051900 TO 1960: Life, Physical and Social Science Technicians
## Job052000 TO 2060: Counselors, Sociala and Religious Workers
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers
## Job052200 TO 2340: Teachers
## Job052400 TO 2550: Education, Training and Library Workers
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers *
## Job052800 TO 2960: Media and Communications Workers
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners .
## Job053300 TO 3650: Health Care Technical and Support Occupations
## Job053700 TO 3950: Protective Service Occupations **
## Job054000 TO 4160: Food Preparation and Serving Related Occupations
## Job054200 TO 4250: Cleaning and Building Service Occupations
## Job054300 TO 4430: Entertainment Attendants and Related Workers .
## Job054500 TO 4650: Personal Care and Service Workers
## Job054700 TO 4960: Sales and Related Workers
## Job05500 TO 950: Management Related Occupations *
## Job055000 TO 5930: Office and Administrative Support Workers .
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations
## Job056200 TO 6940: Construction Trade and Extraction Workers
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers
## Job057700 TO 7750: Production and Operating Workers
## Job057800 TO 7850: Food Preparation Occupations
## Job057900 TO 8960: Setters, Operators and Tenders
## Job059000 TO 9750: Transportation and Material Moving Workers
## Job059990: Uncodeable
## Income05 ***
## Inewspaper *
## Ilibrary
## Intelligence ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.01 on 2394 degrees of freedom
## Multiple R-squared: 0.15, Adjusted R-squared: 0.137
## F-statistic: 11.7 on 36 and 2394 DF, p-value: <2e-16
## Esteem_PC1 ~ Education05 + Income87 + Job05 + Income05 + Inewspaper +
## Ilibrary + Intelligence
This process came up with the model: -2.20 + 0.0774(Education05) + 0.000013(Income87) + 1.24(Entertainers and Performers, Sports and Related Workers) + 1.03(Protective Service Occupations) + 0.843(Management Related Occupations) + 0.00000461(Income05) + 0.299(Inewspaper) + 0.124(Intelligence). In this model, we have identified the jobs that best improve model fit relative to the baseline job, removing the less significant occupations and simplifying the model.
This has an r-squared of 0.15, meaning 15% of variation in Esteem_PC1 can be explained by variation in the independent variables. Furthermore, we calculate an F-statistic of 11.7 and a corresponding p-value less than 2x10^-16, indicating that the overall model is significant in explaining variation in the dependent variable. Finally, we calculate a residual standard error of 2.01, indicating that on average, datapoints are 2.01 standard deviations away from the regression line.
Next, we will conduct a forward step-wise regression and compare the effectiveness of the model.
##
## Call:
## lm(formula = Esteem_PC1 ~ Intelligence + Income05 + Education05 +
## Income87 + Inewspaper + Job05 + Ilibrary, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.393 -1.558 0.002 1.672 5.136
##
## Coefficients:
## Estimate
## (Intercept) -2.20e+00
## Intelligence 1.24e-01
## Income05 4.61e-06
## Education05 7.74e-02
## Income87 1.30e-05
## Inewspaper 2.99e-01
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 5.42e-01
## Job051000 TO 1240: Mathematical and Computer Scientists 6.82e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians -1.27e-01
## Job051600 TO 1760: Physical Scientists -1.52e+00
## Job051800 TO 1860: Social Scientists and Related Workers -6.00e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians 2.37e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 3.12e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 1.24e-01
## Job052200 TO 2340: Teachers 4.28e-01
## Job052400 TO 2550: Education, Training and Library Workers 4.43e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 1.24e+00
## Job052800 TO 2960: Media and Communications Workers 5.16e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 6.32e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations -1.80e-01
## Job053700 TO 3950: Protective Service Occupations 1.03e+00
## Job054000 TO 4160: Food Preparation and Serving Related Occupations -1.68e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations -1.67e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers -1.14e+00
## Job054500 TO 4650: Personal Care and Service Workers 5.29e-01
## Job054700 TO 4960: Sales and Related Workers 4.31e-01
## Job05500 TO 950: Management Related Occupations 8.43e-01
## Job055000 TO 5930: Office and Administrative Support Workers 4.92e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 1.36e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers 6.85e-02
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 2.50e-01
## Job057700 TO 7750: Production and Operating Workers 2.51e-01
## Job057800 TO 7850: Food Preparation Occupations 4.59e-01
## Job057900 TO 8960: Setters, Operators and Tenders 2.94e-01
## Job059000 TO 9750: Transportation and Material Moving Workers -1.42e-01
## Job059990: Uncodeable 2.02e-02
## Ilibrary 1.44e-01
## Std. Error
## (Intercept) 4.27e-01
## Intelligence 2.04e-02
## Income05 1.04e-06
## Education05 2.29e-02
## Income87 3.87e-06
## Inewspaper 1.27e-01
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 2.91e-01
## Job051000 TO 1240: Mathematical and Computer Scientists 3.72e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 3.88e-01
## Job051600 TO 1760: Physical Scientists 1.04e+00
## Job051800 TO 1860: Social Scientists and Related Workers 8.67e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians 8.08e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 4.18e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 5.93e-01
## Job052200 TO 2340: Teachers 3.35e-01
## Job052400 TO 2550: Education, Training and Library Workers 4.62e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 4.93e-01
## Job052800 TO 2960: Media and Communications Workers 6.21e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 3.61e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations 3.37e-01
## Job053700 TO 3950: Protective Service Occupations 3.84e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations 3.65e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations 3.66e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers 6.91e-01
## Job054500 TO 4650: Personal Care and Service Workers 4.12e-01
## Job054700 TO 4960: Sales and Related Workers 3.04e-01
## Job05500 TO 950: Management Related Occupations 3.33e-01
## Job055000 TO 5930: Office and Administrative Support Workers 2.90e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 7.26e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers 3.21e-01
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 3.33e-01
## Job057700 TO 7750: Production and Operating Workers 3.94e-01
## Job057800 TO 7850: Food Preparation Occupations 1.04e+00
## Job057900 TO 8960: Setters, Operators and Tenders 3.31e-01
## Job059000 TO 9750: Transportation and Material Moving Workers 3.28e-01
## Job059990: Uncodeable 2.03e+00
## Ilibrary 1.02e-01
## t value
## (Intercept) -5.14
## Intelligence 6.08
## Income05 4.43
## Education05 3.38
## Income87 3.36
## Inewspaper 2.34
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 1.86
## Job051000 TO 1240: Mathematical and Computer Scientists 1.83
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians -0.33
## Job051600 TO 1760: Physical Scientists -1.45
## Job051800 TO 1860: Social Scientists and Related Workers -0.69
## Job051900 TO 1960: Life, Physical and Social Science Technicians 0.29
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 0.75
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 0.21
## Job052200 TO 2340: Teachers 1.28
## Job052400 TO 2550: Education, Training and Library Workers 0.96
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 2.52
## Job052800 TO 2960: Media and Communications Workers 0.83
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 1.75
## Job053300 TO 3650: Health Care Technical and Support Occupations -0.53
## Job053700 TO 3950: Protective Service Occupations 2.68
## Job054000 TO 4160: Food Preparation and Serving Related Occupations -0.46
## Job054200 TO 4250: Cleaning and Building Service Occupations -0.46
## Job054300 TO 4430: Entertainment Attendants and Related Workers -1.65
## Job054500 TO 4650: Personal Care and Service Workers 1.29
## Job054700 TO 4960: Sales and Related Workers 1.42
## Job05500 TO 950: Management Related Occupations 2.53
## Job055000 TO 5930: Office and Administrative Support Workers 1.70
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 0.19
## Job056200 TO 6940: Construction Trade and Extraction Workers 0.21
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 0.75
## Job057700 TO 7750: Production and Operating Workers 0.64
## Job057800 TO 7850: Food Preparation Occupations 0.44
## Job057900 TO 8960: Setters, Operators and Tenders 0.89
## Job059000 TO 9750: Transportation and Material Moving Workers -0.43
## Job059990: Uncodeable 0.01
## Ilibrary 1.41
## Pr(>|t|)
## (Intercept) 3.0e-07
## Intelligence 1.4e-09
## Income05 9.7e-06
## Education05 0.00075
## Income87 0.00079
## Inewspaper 0.01918
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 0.06237
## Job051000 TO 1240: Mathematical and Computer Scientists 0.06678
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 0.74262
## Job051600 TO 1760: Physical Scientists 0.14623
## Job051800 TO 1860: Social Scientists and Related Workers 0.48877
## Job051900 TO 1960: Life, Physical and Social Science Technicians 0.76896
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 0.45472
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 0.83490
## Job052200 TO 2340: Teachers 0.20125
## Job052400 TO 2550: Education, Training and Library Workers 0.33756
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 0.01188
## Job052800 TO 2960: Media and Communications Workers 0.40623
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 0.08001
## Job053300 TO 3650: Health Care Technical and Support Occupations 0.59297
## Job053700 TO 3950: Protective Service Occupations 0.00751
## Job054000 TO 4160: Food Preparation and Serving Related Occupations 0.64525
## Job054200 TO 4250: Cleaning and Building Service Occupations 0.64848
## Job054300 TO 4430: Entertainment Attendants and Related Workers 0.09992
## Job054500 TO 4650: Personal Care and Service Workers 0.19870
## Job054700 TO 4960: Sales and Related Workers 0.15605
## Job05500 TO 950: Management Related Occupations 0.01151
## Job055000 TO 5930: Office and Administrative Support Workers 0.08928
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 0.85173
## Job056200 TO 6940: Construction Trade and Extraction Workers 0.83113
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 0.45227
## Job057700 TO 7750: Production and Operating Workers 0.52489
## Job057800 TO 7850: Food Preparation Occupations 0.66006
## Job057900 TO 8960: Setters, Operators and Tenders 0.37412
## Job059000 TO 9750: Transportation and Material Moving Workers 0.66501
## Job059990: Uncodeable 0.99207
## Ilibrary 0.15800
##
## (Intercept) ***
## Intelligence ***
## Income05 ***
## Education05 ***
## Income87 ***
## Inewspaper *
## Job0510 TO 430: Executive, Administrative and Managerial Occupations .
## Job051000 TO 1240: Mathematical and Computer Scientists .
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians
## Job051600 TO 1760: Physical Scientists
## Job051800 TO 1860: Social Scientists and Related Workers
## Job051900 TO 1960: Life, Physical and Social Science Technicians
## Job052000 TO 2060: Counselors, Sociala and Religious Workers
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers
## Job052200 TO 2340: Teachers
## Job052400 TO 2550: Education, Training and Library Workers
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers *
## Job052800 TO 2960: Media and Communications Workers
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners .
## Job053300 TO 3650: Health Care Technical and Support Occupations
## Job053700 TO 3950: Protective Service Occupations **
## Job054000 TO 4160: Food Preparation and Serving Related Occupations
## Job054200 TO 4250: Cleaning and Building Service Occupations
## Job054300 TO 4430: Entertainment Attendants and Related Workers .
## Job054500 TO 4650: Personal Care and Service Workers
## Job054700 TO 4960: Sales and Related Workers
## Job05500 TO 950: Management Related Occupations *
## Job055000 TO 5930: Office and Administrative Support Workers .
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations
## Job056200 TO 6940: Construction Trade and Extraction Workers
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers
## Job057700 TO 7750: Production and Operating Workers
## Job057800 TO 7850: Food Preparation Occupations
## Job057900 TO 8960: Setters, Operators and Tenders
## Job059000 TO 9750: Transportation and Material Moving Workers
## Job059990: Uncodeable
## Ilibrary
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.01 on 2394 degrees of freedom
## Multiple R-squared: 0.15, Adjusted R-squared: 0.137
## F-statistic: 11.7 on 36 and 2394 DF, p-value: <2e-16
## Esteem_PC1 ~ Intelligence + Income05 + Education05 + Income87 +
## Inewspaper + Job05 + Ilibrary
This process came up with the exact same model: -2.20 + 0.0774(Education05) + 0.000013(Income87) + 1.24(Entertainers and Performers, Sports and Related Workers) + 1.03(Protective Service Occupations) + 0.843(Management Related Occupations) + 0.00000461(Income05) + 0.299(Inewspaper) + 0.124(Intelligence). In this model, we have identified the jobs that best improve model fit relative to the baseline job, removing the less significant occupations and simplifying the model.
Similarly, this has an r-squared of 0.15, meaning 15% of variation in Esteem_PC1 can be explained by variation in the independent variables. Furthermore, we calculate an F-statistic of 11.7 and a corresponding p-value less than 2x10^-16, indicating that the overall model is significant in explaining variation in the dependent variable. Finally, we calculate a residual standard error of 2.01, indicating that on average, datapoints are 2.01 standard deviations away from the regression line.
Finally, we will conduct an exhaustive search, which invovles testing every possible combination of independent variables in a regression model, and selecting the one with the lowest AIC.
##
## Call:
## lm(formula = best_formula, data = df_exh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.910 -1.617 0.034 1.680 4.791
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.38e+00 3.27e-01 -7.29 4.3e-13 ***
## Education05 9.46e-02 2.08e-02 4.55 5.7e-06 ***
## Income87 1.37e-05 3.85e-06 3.56 0.00038 ***
## Income05 4.81e-06 1.00e-06 4.81 1.6e-06 ***
## Inewspaper 2.50e-01 1.29e-01 1.94 0.05193 .
## Ilibrary 1.56e-01 1.03e-01 1.52 0.12921
## MotherEd 2.65e-02 1.86e-02 1.42 0.15461
## Intelligence 1.28e-01 2.05e-02 6.21 6.2e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.03 on 2423 degrees of freedom
## Multiple R-squared: 0.128, Adjusted R-squared: 0.126
## F-statistic: 50.9 on 7 and 2423 DF, p-value: <2e-16
This process found the most efficient model was: -2.38 + 0.0946(Education05) + 0.0000137(Income87) + 0.00000481(Income05) + 0.25(Inewspaper) + 0.156(Ilibrary) + 0.0265(MotherEd) + 0.128(Intelligence).
Similarly, this has an r-squared of 0.128, meaning 12.8% of variation in Esteem_PC1 can be explained by variation in the independent variables. Furthermore, we calculate an F-statistic of 50.9 and a corresponding p-value less than 2x10^-16, indicating that the overall model is significant in explaining variation in the dependent variable. Finally, we calculate a residual standard error of 2.03, indicating that on average, datapoints are 2.01 standard deviations away from the regression line.
Hence, looking at these different models, we select the forward / backward stepwise regression model because it has a higher r-squared statistic, whilst being similar on RSE and F-statistic significance as the exhaustive model.
To test the normality assumption, we can look at the QQ-Plot. The points lie very close to the diagonal line, even if there are slight curves at the lower and upper tails, however, we can assume that the normality assumption is met. For linearity, we are looking for an even, random distribution of points above and below zero. There is an even distribution of points before fitted value -1, however, after this, there is a clear, linear decrease in points converging at zero as the fitted values get more positive. As such, we cannot conclude that the linearlity assumption holds. Finally, looking at the vertical spread of points in the Residuals vs. Fitted plot, we do not see an even vertical spread of points, because it seems to be converging at zero as the fitted values get more positive. Consequently, this assumptions does not hold.
Thus, the normality assumption holds, but the linearity and homoskedasticity assumptions do not hold.
Looking at the final model, we can conclude that the variables that most affect one's self-esteem is Education05; Income87; Entertainers and Performers, Sports and Relalted Workers; Protective Service Occupations; Inewspaper; Intelligence; Income05; and Management Related Occupations. Thus, holding all other independent variables constant, for every increase in one unit of:
- Education05, Self-Esteem will increase by 0.0774 on average.
- Income87, Self-Esteem will increase 0.000013 on average.
- Income05, Self-Esteem will increase 0.00000461 on average.
- Inewspaper, Self-Esteem will increase 0.299 on average.
- Intelligence, Self-Esteem will increase 0.124 on average.
Similarly, if participants were in these jobs, they experienced an increase in Self-Esteem of:
- 1.24 for Entertainers and Performers, Sports and Related Workers
- 1.03 for Protective Service Occupations
- 0.843 for Management Related Occupations
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).
In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier. (We had hoped to download the data for control groups for each type of the cancer. But failed to do so. Please let us know if you find the appropriate data.)
We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.
We first read the data using data.table::fread() which
is a faster way to read in big data than read.csv().
Summary and transformation
How many patients are there in each sub-type?
Randomly pick 5 genes and plot the histogram by each sub-type.
Clean and transform the mRNA sequences by first remove gene with zero count and no variability and then apply logarithmic transform.
Apply kmeans on the transformed dataset with 4 centers (4
clusters) and output the discrepancy table between the real sub-type
brca_subtype and the cluster labels.
Spectrum clustering: to scale or not to scale?
Apply PCA on the centered and scaled dataset. How many PCs should
we use and why? You are encouraged to use irlba::irlba().
In order to do so please review the section about SVD in PCA
module.
Plot PC1 vs PC2 of the centered and scaled data and PC1 vs PC2 of the centered but unscaled data side by side. Should we scale or not scale for clustering process? Why?
Spectrum clustering: center but do not scale the data
Use the first 4 PCs of the centered and unscaled data and apply kmeans. Find a reasonable number of clusters using within sum of squared with the elbow rule.
Choose an optimal cluster number and apply kmeans. Compare the real sub-type and the clustering label as follows: Plot scatter plot of PC1 vs PC2. Use point color to indicate the true cancer type and point shape to indicate the clustering label. Plot the kmeans centroids with black dots. Summarize how good is clustering results compared to the real sub-type.
Compare the clustering result from applying kmeans to the original data and the clustering result from applying kmeans to 4 PCs. Does PCA help in kmeans clustering? What might be the reasons if PCA helps?
Now we have an x patient with breast cancer but with unknown sub-type. We have this patient’s mRNA sequencing data. Project this x patient to the space of PC1 and PC2. (Hint: remember we remove some gene with no counts or no variablity, take log and centered, then find its PC1 to PC4 scores) Plot this patient in the plot in b) with a black dot as well. Calculate the Euclidean distance between this patient and each of the centroid of the cluster. (Don’t forget the clusters are obtained by using 4 PC’s) Can you tell which sub-type this patient might have?
What determines how fuel efficient a car is? Are Japanese cars more
fuel efficient? To answer thes questions we will build various linear
models using the Auto dataset from the book
ISLR. The original dataset contains information for about
400 different cars built in various years. To get the data, first
install the package ISLR which has been done in the first R-chunk. The
Auto dataset should be loaded automatically. Original data
source is here: https://archive.ics.uci.edu/ml/datasets/auto+mpg
Get familiar with this dataset first. Tip: you can use the command
?ISLR::Auto to view a description of the dataset. Our
response variable will me MPG: miles per gallon.
## mpg cylinders displacement horsepower weight
## Min. : 9.0 Min. :3.00 Min. : 68 Min. : 46.0 Min. :1613
## 1st Qu.:17.0 1st Qu.:4.00 1st Qu.:105 1st Qu.: 75.0 1st Qu.:2225
## Median :22.8 Median :4.00 Median :151 Median : 93.5 Median :2804
## Mean :23.4 Mean :5.47 Mean :194 Mean :104.5 Mean :2978
## 3rd Qu.:29.0 3rd Qu.:8.00 3rd Qu.:276 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.6 Max. :8.00 Max. :455 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.0 Min. :70 Min. :1.00 amc matador : 5
## 1st Qu.:13.8 1st Qu.:73 1st Qu.:1.00 ford pinto : 5
## Median :15.5 Median :76 Median :1.00 toyota corolla : 5
## Mean :15.5 Mean :76 Mean :1.58 amc gremlin : 4
## 3rd Qu.:17.0 3rd Qu.:79 3rd Qu.:2.00 amc hornet : 4
## Max. :24.8 Max. :82 Max. :3.00 chevrolet chevette: 4
## (Other) :365
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18.0 8 307.0 130 3504 12.0 70 1
## 2 15.0 8 350.0 165 3693 11.5 70 1
## 3 18.0 8 318.0 150 3436 11.0 70 1
## 4 16.0 8 304.0 150 3433 12.0 70 1
## 5 17.0 8 302.0 140 3449 10.5 70 1
## 6 15.0 8 429.0 198 4341 10.0 70 1
## 7 14.0 8 454.0 220 4354 9.0 70 1
## 8 14.0 8 440.0 215 4312 8.5 70 1
## 9 14.0 8 455.0 225 4425 10.0 70 1
## 10 15.0 8 390.0 190 3850 8.5 70 1
## 11 15.0 8 383.0 170 3563 10.0 70 1
## 12 14.0 8 340.0 160 3609 8.0 70 1
## 13 15.0 8 400.0 150 3761 9.5 70 1
## 14 14.0 8 455.0 225 3086 10.0 70 1
## 15 24.0 4 113.0 95 2372 15.0 70 3
## 16 22.0 6 198.0 95 2833 15.5 70 1
## 17 18.0 6 199.0 97 2774 15.5 70 1
## 18 21.0 6 200.0 85 2587 16.0 70 1
## 19 27.0 4 97.0 88 2130 14.5 70 3
## 20 26.0 4 97.0 46 1835 20.5 70 2
## 21 25.0 4 110.0 87 2672 17.5 70 2
## 22 24.0 4 107.0 90 2430 14.5 70 2
## 23 25.0 4 104.0 95 2375 17.5 70 2
## 24 26.0 4 121.0 113 2234 12.5 70 2
## 25 21.0 6 199.0 90 2648 15.0 70 1
## 26 10.0 8 360.0 215 4615 14.0 70 1
## 27 10.0 8 307.0 200 4376 15.0 70 1
## 28 11.0 8 318.0 210 4382 13.5 70 1
## 29 9.0 8 304.0 193 4732 18.5 70 1
## 30 27.0 4 97.0 88 2130 14.5 71 3
## 31 28.0 4 140.0 90 2264 15.5 71 1
## 32 25.0 4 113.0 95 2228 14.0 71 3
## 34 19.0 6 232.0 100 2634 13.0 71 1
## 35 16.0 6 225.0 105 3439 15.5 71 1
## 36 17.0 6 250.0 100 3329 15.5 71 1
## 37 19.0 6 250.0 88 3302 15.5 71 1
## 38 18.0 6 232.0 100 3288 15.5 71 1
## 39 14.0 8 350.0 165 4209 12.0 71 1
## 40 14.0 8 400.0 175 4464 11.5 71 1
## 41 14.0 8 351.0 153 4154 13.5 71 1
## 42 14.0 8 318.0 150 4096 13.0 71 1
## 43 12.0 8 383.0 180 4955 11.5 71 1
## 44 13.0 8 400.0 170 4746 12.0 71 1
## 45 13.0 8 400.0 175 5140 12.0 71 1
## 46 18.0 6 258.0 110 2962 13.5 71 1
## 47 22.0 4 140.0 72 2408 19.0 71 1
## 48 19.0 6 250.0 100 3282 15.0 71 1
## 49 18.0 6 250.0 88 3139 14.5 71 1
## 50 23.0 4 122.0 86 2220 14.0 71 1
## 51 28.0 4 116.0 90 2123 14.0 71 2
## 52 30.0 4 79.0 70 2074 19.5 71 2
## 53 30.0 4 88.0 76 2065 14.5 71 2
## 54 31.0 4 71.0 65 1773 19.0 71 3
## 55 35.0 4 72.0 69 1613 18.0 71 3
## 56 27.0 4 97.0 60 1834 19.0 71 2
## 57 26.0 4 91.0 70 1955 20.5 71 1
## 58 24.0 4 113.0 95 2278 15.5 72 3
## 59 25.0 4 97.5 80 2126 17.0 72 1
## 60 23.0 4 97.0 54 2254 23.5 72 2
## 61 20.0 4 140.0 90 2408 19.5 72 1
## 62 21.0 4 122.0 86 2226 16.5 72 1
## 63 13.0 8 350.0 165 4274 12.0 72 1
## 64 14.0 8 400.0 175 4385 12.0 72 1
## 65 15.0 8 318.0 150 4135 13.5 72 1
## 66 14.0 8 351.0 153 4129 13.0 72 1
## 67 17.0 8 304.0 150 3672 11.5 72 1
## 68 11.0 8 429.0 208 4633 11.0 72 1
## 69 13.0 8 350.0 155 4502 13.5 72 1
## 70 12.0 8 350.0 160 4456 13.5 72 1
## 71 13.0 8 400.0 190 4422 12.5 72 1
## 72 19.0 3 70.0 97 2330 13.5 72 3
## 73 15.0 8 304.0 150 3892 12.5 72 1
## 74 13.0 8 307.0 130 4098 14.0 72 1
## 75 13.0 8 302.0 140 4294 16.0 72 1
## 76 14.0 8 318.0 150 4077 14.0 72 1
## 77 18.0 4 121.0 112 2933 14.5 72 2
## 78 22.0 4 121.0 76 2511 18.0 72 2
## 79 21.0 4 120.0 87 2979 19.5 72 2
## 80 26.0 4 96.0 69 2189 18.0 72 2
## 81 22.0 4 122.0 86 2395 16.0 72 1
## 82 28.0 4 97.0 92 2288 17.0 72 3
## 83 23.0 4 120.0 97 2506 14.5 72 3
## 84 28.0 4 98.0 80 2164 15.0 72 1
## 85 27.0 4 97.0 88 2100 16.5 72 3
## 86 13.0 8 350.0 175 4100 13.0 73 1
## 87 14.0 8 304.0 150 3672 11.5 73 1
## 88 13.0 8 350.0 145 3988 13.0 73 1
## 89 14.0 8 302.0 137 4042 14.5 73 1
## 90 15.0 8 318.0 150 3777 12.5 73 1
## 91 12.0 8 429.0 198 4952 11.5 73 1
## 92 13.0 8 400.0 150 4464 12.0 73 1
## 93 13.0 8 351.0 158 4363 13.0 73 1
## 94 14.0 8 318.0 150 4237 14.5 73 1
## 95 13.0 8 440.0 215 4735 11.0 73 1
## 96 12.0 8 455.0 225 4951 11.0 73 1
## 97 13.0 8 360.0 175 3821 11.0 73 1
## 98 18.0 6 225.0 105 3121 16.5 73 1
## 99 16.0 6 250.0 100 3278 18.0 73 1
## 100 18.0 6 232.0 100 2945 16.0 73 1
## 101 18.0 6 250.0 88 3021 16.5 73 1
## 102 23.0 6 198.0 95 2904 16.0 73 1
## 103 26.0 4 97.0 46 1950 21.0 73 2
## 104 11.0 8 400.0 150 4997 14.0 73 1
## 105 12.0 8 400.0 167 4906 12.5 73 1
## 106 13.0 8 360.0 170 4654 13.0 73 1
## 107 12.0 8 350.0 180 4499 12.5 73 1
## 108 18.0 6 232.0 100 2789 15.0 73 1
## 109 20.0 4 97.0 88 2279 19.0 73 3
## 110 21.0 4 140.0 72 2401 19.5 73 1
## 111 22.0 4 108.0 94 2379 16.5 73 3
## 112 18.0 3 70.0 90 2124 13.5 73 3
## 113 19.0 4 122.0 85 2310 18.5 73 1
## 114 21.0 6 155.0 107 2472 14.0 73 1
## 115 26.0 4 98.0 90 2265 15.5 73 2
## 116 15.0 8 350.0 145 4082 13.0 73 1
## 117 16.0 8 400.0 230 4278 9.5 73 1
## 118 29.0 4 68.0 49 1867 19.5 73 2
## 119 24.0 4 116.0 75 2158 15.5 73 2
## 120 20.0 4 114.0 91 2582 14.0 73 2
## 121 19.0 4 121.0 112 2868 15.5 73 2
## 122 15.0 8 318.0 150 3399 11.0 73 1
## 123 24.0 4 121.0 110 2660 14.0 73 2
## 124 20.0 6 156.0 122 2807 13.5 73 3
## 125 11.0 8 350.0 180 3664 11.0 73 1
## 126 20.0 6 198.0 95 3102 16.5 74 1
## 128 19.0 6 232.0 100 2901 16.0 74 1
## 129 15.0 6 250.0 100 3336 17.0 74 1
## 130 31.0 4 79.0 67 1950 19.0 74 3
## 131 26.0 4 122.0 80 2451 16.5 74 1
## 132 32.0 4 71.0 65 1836 21.0 74 3
## 133 25.0 4 140.0 75 2542 17.0 74 1
## 134 16.0 6 250.0 100 3781 17.0 74 1
## 135 16.0 6 258.0 110 3632 18.0 74 1
## 136 18.0 6 225.0 105 3613 16.5 74 1
## 137 16.0 8 302.0 140 4141 14.0 74 1
## 138 13.0 8 350.0 150 4699 14.5 74 1
## 139 14.0 8 318.0 150 4457 13.5 74 1
## 140 14.0 8 302.0 140 4638 16.0 74 1
## 141 14.0 8 304.0 150 4257 15.5 74 1
## 142 29.0 4 98.0 83 2219 16.5 74 2
## 143 26.0 4 79.0 67 1963 15.5 74 2
## 144 26.0 4 97.0 78 2300 14.5 74 2
## 145 31.0 4 76.0 52 1649 16.5 74 3
## 146 32.0 4 83.0 61 2003 19.0 74 3
## 147 28.0 4 90.0 75 2125 14.5 74 1
## 148 24.0 4 90.0 75 2108 15.5 74 2
## 149 26.0 4 116.0 75 2246 14.0 74 2
## 150 24.0 4 120.0 97 2489 15.0 74 3
## 151 26.0 4 108.0 93 2391 15.5 74 3
## 152 31.0 4 79.0 67 2000 16.0 74 2
## 153 19.0 6 225.0 95 3264 16.0 75 1
## 154 18.0 6 250.0 105 3459 16.0 75 1
## 155 15.0 6 250.0 72 3432 21.0 75 1
## 156 15.0 6 250.0 72 3158 19.5 75 1
## 157 16.0 8 400.0 170 4668 11.5 75 1
## 158 15.0 8 350.0 145 4440 14.0 75 1
## 159 16.0 8 318.0 150 4498 14.5 75 1
## 160 14.0 8 351.0 148 4657 13.5 75 1
## 161 17.0 6 231.0 110 3907 21.0 75 1
## 162 16.0 6 250.0 105 3897 18.5 75 1
## 163 15.0 6 258.0 110 3730 19.0 75 1
## 164 18.0 6 225.0 95 3785 19.0 75 1
## 165 21.0 6 231.0 110 3039 15.0 75 1
## 166 20.0 8 262.0 110 3221 13.5 75 1
## 167 13.0 8 302.0 129 3169 12.0 75 1
## 168 29.0 4 97.0 75 2171 16.0 75 3
## 169 23.0 4 140.0 83 2639 17.0 75 1
## 170 20.0 6 232.0 100 2914 16.0 75 1
## 171 23.0 4 140.0 78 2592 18.5 75 1
## 172 24.0 4 134.0 96 2702 13.5 75 3
## 173 25.0 4 90.0 71 2223 16.5 75 2
## 174 24.0 4 119.0 97 2545 17.0 75 3
## 175 18.0 6 171.0 97 2984 14.5 75 1
## 176 29.0 4 90.0 70 1937 14.0 75 2
## 177 19.0 6 232.0 90 3211 17.0 75 1
## 178 23.0 4 115.0 95 2694 15.0 75 2
## 179 23.0 4 120.0 88 2957 17.0 75 2
## 180 22.0 4 121.0 98 2945 14.5 75 2
## 181 25.0 4 121.0 115 2671 13.5 75 2
## 182 33.0 4 91.0 53 1795 17.5 75 3
## 183 28.0 4 107.0 86 2464 15.5 76 2
## 184 25.0 4 116.0 81 2220 16.9 76 2
## 185 25.0 4 140.0 92 2572 14.9 76 1
## 186 26.0 4 98.0 79 2255 17.7 76 1
## 187 27.0 4 101.0 83 2202 15.3 76 2
## 188 17.5 8 305.0 140 4215 13.0 76 1
## 189 16.0 8 318.0 150 4190 13.0 76 1
## 190 15.5 8 304.0 120 3962 13.9 76 1
## 191 14.5 8 351.0 152 4215 12.8 76 1
## 192 22.0 6 225.0 100 3233 15.4 76 1
## 193 22.0 6 250.0 105 3353 14.5 76 1
## 194 24.0 6 200.0 81 3012 17.6 76 1
## 195 22.5 6 232.0 90 3085 17.6 76 1
## 196 29.0 4 85.0 52 2035 22.2 76 1
## 197 24.5 4 98.0 60 2164 22.1 76 1
## 198 29.0 4 90.0 70 1937 14.2 76 2
## 199 33.0 4 91.0 53 1795 17.4 76 3
## 200 20.0 6 225.0 100 3651 17.7 76 1
## 201 18.0 6 250.0 78 3574 21.0 76 1
## 202 18.5 6 250.0 110 3645 16.2 76 1
## 203 17.5 6 258.0 95 3193 17.8 76 1
## 204 29.5 4 97.0 71 1825 12.2 76 2
## 205 32.0 4 85.0 70 1990 17.0 76 3
## 206 28.0 4 97.0 75 2155 16.4 76 3
## 207 26.5 4 140.0 72 2565 13.6 76 1
## 208 20.0 4 130.0 102 3150 15.7 76 2
## 209 13.0 8 318.0 150 3940 13.2 76 1
## 210 19.0 4 120.0 88 3270 21.9 76 2
## 211 19.0 6 156.0 108 2930 15.5 76 3
## 212 16.5 6 168.0 120 3820 16.7 76 2
## 213 16.5 8 350.0 180 4380 12.1 76 1
## 214 13.0 8 350.0 145 4055 12.0 76 1
## 215 13.0 8 302.0 130 3870 15.0 76 1
## 216 13.0 8 318.0 150 3755 14.0 76 1
## 217 31.5 4 98.0 68 2045 18.5 77 3
## 218 30.0 4 111.0 80 2155 14.8 77 1
## 219 36.0 4 79.0 58 1825 18.6 77 2
## 220 25.5 4 122.0 96 2300 15.5 77 1
## 221 33.5 4 85.0 70 1945 16.8 77 3
## 222 17.5 8 305.0 145 3880 12.5 77 1
## 223 17.0 8 260.0 110 4060 19.0 77 1
## 224 15.5 8 318.0 145 4140 13.7 77 1
## 225 15.0 8 302.0 130 4295 14.9 77 1
## 226 17.5 6 250.0 110 3520 16.4 77 1
## 227 20.5 6 231.0 105 3425 16.9 77 1
## 228 19.0 6 225.0 100 3630 17.7 77 1
## 229 18.5 6 250.0 98 3525 19.0 77 1
## 230 16.0 8 400.0 180 4220 11.1 77 1
## 231 15.5 8 350.0 170 4165 11.4 77 1
## 232 15.5 8 400.0 190 4325 12.2 77 1
## 233 16.0 8 351.0 149 4335 14.5 77 1
## 234 29.0 4 97.0 78 1940 14.5 77 2
## 235 24.5 4 151.0 88 2740 16.0 77 1
## 236 26.0 4 97.0 75 2265 18.2 77 3
## 237 25.5 4 140.0 89 2755 15.8 77 1
## 238 30.5 4 98.0 63 2051 17.0 77 1
## 239 33.5 4 98.0 83 2075 15.9 77 1
## 240 30.0 4 97.0 67 1985 16.4 77 3
## 241 30.5 4 97.0 78 2190 14.1 77 2
## 242 22.0 6 146.0 97 2815 14.5 77 3
## 243 21.5 4 121.0 110 2600 12.8 77 2
## 244 21.5 3 80.0 110 2720 13.5 77 3
## 245 43.1 4 90.0 48 1985 21.5 78 2
## 246 36.1 4 98.0 66 1800 14.4 78 1
## 247 32.8 4 78.0 52 1985 19.4 78 3
## 248 39.4 4 85.0 70 2070 18.6 78 3
## 249 36.1 4 91.0 60 1800 16.4 78 3
## 250 19.9 8 260.0 110 3365 15.5 78 1
## 251 19.4 8 318.0 140 3735 13.2 78 1
## 252 20.2 8 302.0 139 3570 12.8 78 1
## 253 19.2 6 231.0 105 3535 19.2 78 1
## 254 20.5 6 200.0 95 3155 18.2 78 1
## 255 20.2 6 200.0 85 2965 15.8 78 1
## 256 25.1 4 140.0 88 2720 15.4 78 1
## 257 20.5 6 225.0 100 3430 17.2 78 1
## 258 19.4 6 232.0 90 3210 17.2 78 1
## 259 20.6 6 231.0 105 3380 15.8 78 1
## 260 20.8 6 200.0 85 3070 16.7 78 1
## 261 18.6 6 225.0 110 3620 18.7 78 1
## 262 18.1 6 258.0 120 3410 15.1 78 1
## 263 19.2 8 305.0 145 3425 13.2 78 1
## 264 17.7 6 231.0 165 3445 13.4 78 1
## 265 18.1 8 302.0 139 3205 11.2 78 1
## 266 17.5 8 318.0 140 4080 13.7 78 1
## 267 30.0 4 98.0 68 2155 16.5 78 1
## 268 27.5 4 134.0 95 2560 14.2 78 3
## 269 27.2 4 119.0 97 2300 14.7 78 3
## 270 30.9 4 105.0 75 2230 14.5 78 1
## 271 21.1 4 134.0 95 2515 14.8 78 3
## 272 23.2 4 156.0 105 2745 16.7 78 1
## 273 23.8 4 151.0 85 2855 17.6 78 1
## 274 23.9 4 119.0 97 2405 14.9 78 3
## 275 20.3 5 131.0 103 2830 15.9 78 2
## 276 17.0 6 163.0 125 3140 13.6 78 2
## 277 21.6 4 121.0 115 2795 15.7 78 2
## 278 16.2 6 163.0 133 3410 15.8 78 2
## 279 31.5 4 89.0 71 1990 14.9 78 2
## 280 29.5 4 98.0 68 2135 16.6 78 3
## 281 21.5 6 231.0 115 3245 15.4 79 1
## 282 19.8 6 200.0 85 2990 18.2 79 1
## 283 22.3 4 140.0 88 2890 17.3 79 1
## 284 20.2 6 232.0 90 3265 18.2 79 1
## 285 20.6 6 225.0 110 3360 16.6 79 1
## 286 17.0 8 305.0 130 3840 15.4 79 1
## 287 17.6 8 302.0 129 3725 13.4 79 1
## 288 16.5 8 351.0 138 3955 13.2 79 1
## 289 18.2 8 318.0 135 3830 15.2 79 1
## 290 16.9 8 350.0 155 4360 14.9 79 1
## 291 15.5 8 351.0 142 4054 14.3 79 1
## 292 19.2 8 267.0 125 3605 15.0 79 1
## 293 18.5 8 360.0 150 3940 13.0 79 1
## 294 31.9 4 89.0 71 1925 14.0 79 2
## 295 34.1 4 86.0 65 1975 15.2 79 3
## 296 35.7 4 98.0 80 1915 14.4 79 1
## 297 27.4 4 121.0 80 2670 15.0 79 1
## 298 25.4 5 183.0 77 3530 20.1 79 2
## 299 23.0 8 350.0 125 3900 17.4 79 1
## 300 27.2 4 141.0 71 3190 24.8 79 2
## 301 23.9 8 260.0 90 3420 22.2 79 1
## 302 34.2 4 105.0 70 2200 13.2 79 1
## 303 34.5 4 105.0 70 2150 14.9 79 1
## 304 31.8 4 85.0 65 2020 19.2 79 3
## 305 37.3 4 91.0 69 2130 14.7 79 2
## 306 28.4 4 151.0 90 2670 16.0 79 1
## 307 28.8 6 173.0 115 2595 11.3 79 1
## 308 26.8 6 173.0 115 2700 12.9 79 1
## 309 33.5 4 151.0 90 2556 13.2 79 1
## 310 41.5 4 98.0 76 2144 14.7 80 2
## 311 38.1 4 89.0 60 1968 18.8 80 3
## 312 32.1 4 98.0 70 2120 15.5 80 1
## 313 37.2 4 86.0 65 2019 16.4 80 3
## 314 28.0 4 151.0 90 2678 16.5 80 1
## 315 26.4 4 140.0 88 2870 18.1 80 1
## 316 24.3 4 151.0 90 3003 20.1 80 1
## 317 19.1 6 225.0 90 3381 18.7 80 1
## 318 34.3 4 97.0 78 2188 15.8 80 2
## 319 29.8 4 134.0 90 2711 15.5 80 3
## 320 31.3 4 120.0 75 2542 17.5 80 3
## 321 37.0 4 119.0 92 2434 15.0 80 3
## 322 32.2 4 108.0 75 2265 15.2 80 3
## 323 46.6 4 86.0 65 2110 17.9 80 3
## 324 27.9 4 156.0 105 2800 14.4 80 1
## 325 40.8 4 85.0 65 2110 19.2 80 3
## 326 44.3 4 90.0 48 2085 21.7 80 2
## 327 43.4 4 90.0 48 2335 23.7 80 2
## 328 36.4 5 121.0 67 2950 19.9 80 2
## 329 30.0 4 146.0 67 3250 21.8 80 2
## 330 44.6 4 91.0 67 1850 13.8 80 3
## 332 33.8 4 97.0 67 2145 18.0 80 3
## 333 29.8 4 89.0 62 1845 15.3 80 2
## 334 32.7 6 168.0 132 2910 11.4 80 3
## 335 23.7 3 70.0 100 2420 12.5 80 3
## 336 35.0 4 122.0 88 2500 15.1 80 2
## 338 32.4 4 107.0 72 2290 17.0 80 3
## 339 27.2 4 135.0 84 2490 15.7 81 1
## 340 26.6 4 151.0 84 2635 16.4 81 1
## 341 25.8 4 156.0 92 2620 14.4 81 1
## 342 23.5 6 173.0 110 2725 12.6 81 1
## 343 30.0 4 135.0 84 2385 12.9 81 1
## 344 39.1 4 79.0 58 1755 16.9 81 3
## 345 39.0 4 86.0 64 1875 16.4 81 1
## 346 35.1 4 81.0 60 1760 16.1 81 3
## 347 32.3 4 97.0 67 2065 17.8 81 3
## 348 37.0 4 85.0 65 1975 19.4 81 3
## 349 37.7 4 89.0 62 2050 17.3 81 3
## 350 34.1 4 91.0 68 1985 16.0 81 3
## 351 34.7 4 105.0 63 2215 14.9 81 1
## 352 34.4 4 98.0 65 2045 16.2 81 1
## 353 29.9 4 98.0 65 2380 20.7 81 1
## 354 33.0 4 105.0 74 2190 14.2 81 2
## 356 33.7 4 107.0 75 2210 14.4 81 3
## 357 32.4 4 108.0 75 2350 16.8 81 3
## 358 32.9 4 119.0 100 2615 14.8 81 3
## 359 31.6 4 120.0 74 2635 18.3 81 3
## 360 28.1 4 141.0 80 3230 20.4 81 2
## 361 30.7 6 145.0 76 3160 19.6 81 2
## 362 25.4 6 168.0 116 2900 12.6 81 3
## 363 24.2 6 146.0 120 2930 13.8 81 3
## 364 22.4 6 231.0 110 3415 15.8 81 1
## 365 26.6 8 350.0 105 3725 19.0 81 1
## 366 20.2 6 200.0 88 3060 17.1 81 1
## 367 17.6 6 225.0 85 3465 16.6 81 1
## 368 28.0 4 112.0 88 2605 19.6 82 1
## 369 27.0 4 112.0 88 2640 18.6 82 1
## 370 34.0 4 112.0 88 2395 18.0 82 1
## 371 31.0 4 112.0 85 2575 16.2 82 1
## 372 29.0 4 135.0 84 2525 16.0 82 1
## 373 27.0 4 151.0 90 2735 18.0 82 1
## 374 24.0 4 140.0 92 2865 16.4 82 1
## 375 36.0 4 105.0 74 1980 15.3 82 2
## 376 37.0 4 91.0 68 2025 18.2 82 3
## 377 31.0 4 91.0 68 1970 17.6 82 3
## 378 38.0 4 105.0 63 2125 14.7 82 1
## 379 36.0 4 98.0 70 2125 17.3 82 1
## 380 36.0 4 120.0 88 2160 14.5 82 3
## 381 36.0 4 107.0 75 2205 14.5 82 3
## 382 34.0 4 108.0 70 2245 16.9 82 3
## 383 38.0 4 91.0 67 1965 15.0 82 3
## 384 32.0 4 91.0 67 1965 15.7 82 3
## 385 38.0 4 91.0 67 1995 16.2 82 3
## 386 25.0 6 181.0 110 2945 16.4 82 1
## 387 38.0 6 262.0 85 3015 17.0 82 1
## 388 26.0 4 156.0 92 2585 14.5 82 1
## 389 22.0 6 232.0 112 2835 14.7 82 1
## 390 32.0 4 144.0 96 2665 13.9 82 3
## 391 36.0 4 135.0 84 2370 13.0 82 1
## 392 27.0 4 151.0 90 2950 17.3 82 1
## 393 27.0 4 140.0 86 2790 15.6 82 1
## 394 44.0 4 97.0 52 2130 24.6 82 2
## 395 32.0 4 135.0 84 2295 11.6 82 1
## 396 28.0 4 120.0 79 2625 18.6 82 1
## 397 31.0 4 119.0 82 2720 19.4 82 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
## 7 chevrolet impala
## 8 plymouth fury iii
## 9 pontiac catalina
## 10 amc ambassador dpl
## 11 dodge challenger se
## 12 plymouth 'cuda 340
## 13 chevrolet monte carlo
## 14 buick estate wagon (sw)
## 15 toyota corona mark ii
## 16 plymouth duster
## 17 amc hornet
## 18 ford maverick
## 19 datsun pl510
## 20 volkswagen 1131 deluxe sedan
## 21 peugeot 504
## 22 audi 100 ls
## 23 saab 99e
## 24 bmw 2002
## 25 amc gremlin
## 26 ford f250
## 27 chevy c20
## 28 dodge d200
## 29 hi 1200d
## 30 datsun pl510
## 31 chevrolet vega 2300
## 32 toyota corona
## 34 amc gremlin
## 35 plymouth satellite custom
## 36 chevrolet chevelle malibu
## 37 ford torino 500
## 38 amc matador
## 39 chevrolet impala
## 40 pontiac catalina brougham
## 41 ford galaxie 500
## 42 plymouth fury iii
## 43 dodge monaco (sw)
## 44 ford country squire (sw)
## 45 pontiac safari (sw)
## 46 amc hornet sportabout (sw)
## 47 chevrolet vega (sw)
## 48 pontiac firebird
## 49 ford mustang
## 50 mercury capri 2000
## 51 opel 1900
## 52 peugeot 304
## 53 fiat 124b
## 54 toyota corolla 1200
## 55 datsun 1200
## 56 volkswagen model 111
## 57 plymouth cricket
## 58 toyota corona hardtop
## 59 dodge colt hardtop
## 60 volkswagen type 3
## 61 chevrolet vega
## 62 ford pinto runabout
## 63 chevrolet impala
## 64 pontiac catalina
## 65 plymouth fury iii
## 66 ford galaxie 500
## 67 amc ambassador sst
## 68 mercury marquis
## 69 buick lesabre custom
## 70 oldsmobile delta 88 royale
## 71 chrysler newport royal
## 72 mazda rx2 coupe
## 73 amc matador (sw)
## 74 chevrolet chevelle concours (sw)
## 75 ford gran torino (sw)
## 76 plymouth satellite custom (sw)
## 77 volvo 145e (sw)
## 78 volkswagen 411 (sw)
## 79 peugeot 504 (sw)
## 80 renault 12 (sw)
## 81 ford pinto (sw)
## 82 datsun 510 (sw)
## 83 toyouta corona mark ii (sw)
## 84 dodge colt (sw)
## 85 toyota corolla 1600 (sw)
## 86 buick century 350
## 87 amc matador
## 88 chevrolet malibu
## 89 ford gran torino
## 90 dodge coronet custom
## 91 mercury marquis brougham
## 92 chevrolet caprice classic
## 93 ford ltd
## 94 plymouth fury gran sedan
## 95 chrysler new yorker brougham
## 96 buick electra 225 custom
## 97 amc ambassador brougham
## 98 plymouth valiant
## 99 chevrolet nova custom
## 100 amc hornet
## 101 ford maverick
## 102 plymouth duster
## 103 volkswagen super beetle
## 104 chevrolet impala
## 105 ford country
## 106 plymouth custom suburb
## 107 oldsmobile vista cruiser
## 108 amc gremlin
## 109 toyota carina
## 110 chevrolet vega
## 111 datsun 610
## 112 maxda rx3
## 113 ford pinto
## 114 mercury capri v6
## 115 fiat 124 sport coupe
## 116 chevrolet monte carlo s
## 117 pontiac grand prix
## 118 fiat 128
## 119 opel manta
## 120 audi 100ls
## 121 volvo 144ea
## 122 dodge dart custom
## 123 saab 99le
## 124 toyota mark ii
## 125 oldsmobile omega
## 126 plymouth duster
## 128 amc hornet
## 129 chevrolet nova
## 130 datsun b210
## 131 ford pinto
## 132 toyota corolla 1200
## 133 chevrolet vega
## 134 chevrolet chevelle malibu classic
## 135 amc matador
## 136 plymouth satellite sebring
## 137 ford gran torino
## 138 buick century luxus (sw)
## 139 dodge coronet custom (sw)
## 140 ford gran torino (sw)
## 141 amc matador (sw)
## 142 audi fox
## 143 volkswagen dasher
## 144 opel manta
## 145 toyota corona
## 146 datsun 710
## 147 dodge colt
## 148 fiat 128
## 149 fiat 124 tc
## 150 honda civic
## 151 subaru
## 152 fiat x1.9
## 153 plymouth valiant custom
## 154 chevrolet nova
## 155 mercury monarch
## 156 ford maverick
## 157 pontiac catalina
## 158 chevrolet bel air
## 159 plymouth grand fury
## 160 ford ltd
## 161 buick century
## 162 chevroelt chevelle malibu
## 163 amc matador
## 164 plymouth fury
## 165 buick skyhawk
## 166 chevrolet monza 2+2
## 167 ford mustang ii
## 168 toyota corolla
## 169 ford pinto
## 170 amc gremlin
## 171 pontiac astro
## 172 toyota corona
## 173 volkswagen dasher
## 174 datsun 710
## 175 ford pinto
## 176 volkswagen rabbit
## 177 amc pacer
## 178 audi 100ls
## 179 peugeot 504
## 180 volvo 244dl
## 181 saab 99le
## 182 honda civic cvcc
## 183 fiat 131
## 184 opel 1900
## 185 capri ii
## 186 dodge colt
## 187 renault 12tl
## 188 chevrolet chevelle malibu classic
## 189 dodge coronet brougham
## 190 amc matador
## 191 ford gran torino
## 192 plymouth valiant
## 193 chevrolet nova
## 194 ford maverick
## 195 amc hornet
## 196 chevrolet chevette
## 197 chevrolet woody
## 198 vw rabbit
## 199 honda civic
## 200 dodge aspen se
## 201 ford granada ghia
## 202 pontiac ventura sj
## 203 amc pacer d/l
## 204 volkswagen rabbit
## 205 datsun b-210
## 206 toyota corolla
## 207 ford pinto
## 208 volvo 245
## 209 plymouth volare premier v8
## 210 peugeot 504
## 211 toyota mark ii
## 212 mercedes-benz 280s
## 213 cadillac seville
## 214 chevy c10
## 215 ford f108
## 216 dodge d100
## 217 honda accord cvcc
## 218 buick opel isuzu deluxe
## 219 renault 5 gtl
## 220 plymouth arrow gs
## 221 datsun f-10 hatchback
## 222 chevrolet caprice classic
## 223 oldsmobile cutlass supreme
## 224 dodge monaco brougham
## 225 mercury cougar brougham
## 226 chevrolet concours
## 227 buick skylark
## 228 plymouth volare custom
## 229 ford granada
## 230 pontiac grand prix lj
## 231 chevrolet monte carlo landau
## 232 chrysler cordoba
## 233 ford thunderbird
## 234 volkswagen rabbit custom
## 235 pontiac sunbird coupe
## 236 toyota corolla liftback
## 237 ford mustang ii 2+2
## 238 chevrolet chevette
## 239 dodge colt m/m
## 240 subaru dl
## 241 volkswagen dasher
## 242 datsun 810
## 243 bmw 320i
## 244 mazda rx-4
## 245 volkswagen rabbit custom diesel
## 246 ford fiesta
## 247 mazda glc deluxe
## 248 datsun b210 gx
## 249 honda civic cvcc
## 250 oldsmobile cutlass salon brougham
## 251 dodge diplomat
## 252 mercury monarch ghia
## 253 pontiac phoenix lj
## 254 chevrolet malibu
## 255 ford fairmont (auto)
## 256 ford fairmont (man)
## 257 plymouth volare
## 258 amc concord
## 259 buick century special
## 260 mercury zephyr
## 261 dodge aspen
## 262 amc concord d/l
## 263 chevrolet monte carlo landau
## 264 buick regal sport coupe (turbo)
## 265 ford futura
## 266 dodge magnum xe
## 267 chevrolet chevette
## 268 toyota corona
## 269 datsun 510
## 270 dodge omni
## 271 toyota celica gt liftback
## 272 plymouth sapporo
## 273 oldsmobile starfire sx
## 274 datsun 200-sx
## 275 audi 5000
## 276 volvo 264gl
## 277 saab 99gle
## 278 peugeot 604sl
## 279 volkswagen scirocco
## 280 honda accord lx
## 281 pontiac lemans v6
## 282 mercury zephyr 6
## 283 ford fairmont 4
## 284 amc concord dl 6
## 285 dodge aspen 6
## 286 chevrolet caprice classic
## 287 ford ltd landau
## 288 mercury grand marquis
## 289 dodge st. regis
## 290 buick estate wagon (sw)
## 291 ford country squire (sw)
## 292 chevrolet malibu classic (sw)
## 293 chrysler lebaron town @ country (sw)
## 294 vw rabbit custom
## 295 maxda glc deluxe
## 296 dodge colt hatchback custom
## 297 amc spirit dl
## 298 mercedes benz 300d
## 299 cadillac eldorado
## 300 peugeot 504
## 301 oldsmobile cutlass salon brougham
## 302 plymouth horizon
## 303 plymouth horizon tc3
## 304 datsun 210
## 305 fiat strada custom
## 306 buick skylark limited
## 307 chevrolet citation
## 308 oldsmobile omega brougham
## 309 pontiac phoenix
## 310 vw rabbit
## 311 toyota corolla tercel
## 312 chevrolet chevette
## 313 datsun 310
## 314 chevrolet citation
## 315 ford fairmont
## 316 amc concord
## 317 dodge aspen
## 318 audi 4000
## 319 toyota corona liftback
## 320 mazda 626
## 321 datsun 510 hatchback
## 322 toyota corolla
## 323 mazda glc
## 324 dodge colt
## 325 datsun 210
## 326 vw rabbit c (diesel)
## 327 vw dasher (diesel)
## 328 audi 5000s (diesel)
## 329 mercedes-benz 240d
## 330 honda civic 1500 gl
## 332 subaru dl
## 333 vokswagen rabbit
## 334 datsun 280-zx
## 335 mazda rx-7 gs
## 336 triumph tr7 coupe
## 338 honda accord
## 339 plymouth reliant
## 340 buick skylark
## 341 dodge aries wagon (sw)
## 342 chevrolet citation
## 343 plymouth reliant
## 344 toyota starlet
## 345 plymouth champ
## 346 honda civic 1300
## 347 subaru
## 348 datsun 210 mpg
## 349 toyota tercel
## 350 mazda glc 4
## 351 plymouth horizon 4
## 352 ford escort 4w
## 353 ford escort 2h
## 354 volkswagen jetta
## 356 honda prelude
## 357 toyota corolla
## 358 datsun 200sx
## 359 mazda 626
## 360 peugeot 505s turbo diesel
## 361 volvo diesel
## 362 toyota cressida
## 363 datsun 810 maxima
## 364 buick century
## 365 oldsmobile cutlass ls
## 366 ford granada gl
## 367 chrysler lebaron salon
## 368 chevrolet cavalier
## 369 chevrolet cavalier wagon
## 370 chevrolet cavalier 2-door
## 371 pontiac j2000 se hatchback
## 372 dodge aries se
## 373 pontiac phoenix
## 374 ford fairmont futura
## 375 volkswagen rabbit l
## 376 mazda glc custom l
## 377 mazda glc custom
## 378 plymouth horizon miser
## 379 mercury lynx l
## 380 nissan stanza xe
## 381 honda accord
## 382 toyota corolla
## 383 honda civic
## 384 honda civic (auto)
## 385 datsun 310 gx
## 386 buick century limited
## 387 oldsmobile cutlass ciera (diesel)
## 388 chrysler lebaron medallion
## 389 ford granada l
## 390 toyota celica gt
## 391 dodge charger 2.2
## 392 chevrolet camaro
## 393 ford mustang gl
## 394 vw pickup
## 395 dodge rampage
## 396 ford ranger
## 397 chevy s-10
origin should be set as a factor.Explaining the Variables:
mpg: Miles per gallon, a measurement of fuel economy,
outlining how many miles the car is able to travel on a gallon of gas.
cylinders: The number of cylinders the car’s engine has,
ranging from 4-8 cylinders. displacement: The engine
displacement in cubic inches. horsepower: The power of the
engine in horsepower. weight: The weight of the vehicle in
lbs. acceleration: Time to accelerate from 0 to 60 mph in
seconds. year: The model year of the vehicle.
origin: The car’s country of origin. name:
Brand and model name of the vehicle.
All variables except origin and name are
numeric. origin is a categorical factor that represents the
country of manufacture.
## [1] 392
392 cars are included in this data set.
## mean_mpg sd_mpg min_mpg max_mpg
## 1 23.4 7.81 9 46.6
The average mpg was 23.4, with the least fuel economic car achieving only 9 mpg, while the more fuel economic car achieved 46.6 mpg. There is moderate variation in the mpg, with a standard deviation of 7.81 mpg.
## # A tibble: 3 × 3
## origin mean_mpg sd_mpg
## <fct> <dbl> <dbl>
## 1 USA 20.0 6.44
## 2 Europe 27.6 6.58
## 3 Japan 30.5 6.09
Grouping the vehicles by country, the mean and standard deviation of fuel economy reveals that Japanese cars on average have the greatest mpg at 30.5, followed by European then American vehicles.
The fuel economy in Japanese cars was also the most consistent, with a standard deviation of 6.09 mpg, compared to 6.58 and 6.44 mpg for European and American cars respectively.
Relationship between Displacement and MPG
The scatterplot above shows that as the engine’s displacement increases, the MPG decreases.
Relationship between Weight and MPG
Relationship between Acceleration and MPG
time have on MPG?mpg
vs. year and report R’s summary output. Is
year a significant variable at the .05 level? State what
effect year has on mpg, if any, according to
this model.## 2.5 % 97.5 %
## (Intercept) -2746.46 -2067.7
## year 1.06 1.4
##
## Call:
## lm(formula = mpg ~ year, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.021 -5.441 -0.441 4.974 18.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.41e+03 1.73e+02 -13.9 <2e-16 ***
## year 1.23e+00 8.74e-02 14.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.36 on 390 degrees of freedom
## Multiple R-squared: 0.337, Adjusted R-squared: 0.335
## F-statistic: 198 on 1 and 390 DF, p-value: <2e-16
The p-value for this linear model is <2e-16. Hence, we can conclude that year is a significant variable at the .05 level. The regression line suggests that on average, as the year increases, the mpg increases by 1.23.
horsepower on top of the variable year
to your linear model. Is year still a significant variable
at the .05 level? Give a precise interpretation of the
year’s effect found here.As the horsepower output of the vehicle increases, the mpg decreases.
## 2.5 % 97.5 %
## (Intercept) -1519.516 -1003.580
## year 0.527 0.788
## horsepower -0.144 -0.119
##
## Call:
## lm(formula = mpg ~ year + horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.077 -3.078 -0.431 2.588 15.315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.26e+03 1.31e+02 -9.61 <2e-16 ***
## year 6.57e-01 6.63e-02 9.92 <2e-16 ***
## horsepower -1.32e-01 6.34e-03 -20.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.39 on 389 degrees of freedom
## Multiple R-squared: 0.685, Adjusted R-squared: 0.684
## F-statistic: 424 on 2 and 389 DF, p-value: <2e-16
After fitting the multiple linear regression model, while holding the horsepower constant, a one-year increase in the model year is associated with an average increase of approximately 0.657 mpg. Comparing to simple regression, where a one-year increase in model year lead to an increase of 1.23 mpg on average, after controlling for horsepower, it almost halved.
Now holding year constant, a one-unit increase in horsepower leads to a decrease of about 0.132 mpg on average. The p-values for both year and horsepower are < 2e-16, meaning they are both statistically significant at the 0.05 level.
The residual standard error suggests that the typical deviation of observed mpg values from the fitted regression surface is about 4.39 mpg.
For the simple regression, the 95% CI was \[(1.06, 1.40)\]. This means that each additional year is associated with an increase of 1.06-1.4 mpg, when year is the only predictor.
For the multiple linear regression, the 95% CI was \[(0.527, 0.788)\], which means that the an increase in year leads to only abour 0.66 mpg, after adjusting for horsepower.
## [1] -0.416
Obtaining the correlation between year and horsepower, we see there is a negative correlation. This means that as year increases, horsepower tends to decrease on average. Historically, this is true, as vehicles from the 1970s had bigger engines with higher horsepower. However, after the energy crisis in the 1970s and 1980s, the sizes of engines got much smaller, which in turn decreased the horsepower output of these vehicles.
This explains why when we keep horsepower constant as year increases, the fuel economy doesn’t improve as much, giving a lower confidence interval.
lm(mpg ~ year * horsepower). Is the interaction effect
significant at .05 level? Explain the year effect (if any).##
## Call:
## lm(formula = mpg ~ year * horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.349 -2.451 -0.456 2.406 14.444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.29e+03 3.19e+02 -13.5 <2e-16 ***
## year 2.19e+00 1.61e-01 13.6 <2e-16 ***
## horsepower 3.14e+01 3.08e+00 10.2 <2e-16 ***
## year:horsepower -1.60e-02 1.56e-03 -10.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.9 on 388 degrees of freedom
## Multiple R-squared: 0.752, Adjusted R-squared: 0.75
## F-statistic: 393 on 3 and 388 DF, p-value: <2e-16
The interaction between year and horsepower is statistically significant at the 0.05 level as the p-values are <2e-16. The fitted model is effectively \[ mpg = \beta_0+\beta_1\times year+\beta_2\times horsepower+\beta_3(year\times horsepower)\] From this we can conclude that the effect of year now also depends on horsepower.
The negative interaction coefficient of -0.016 for year:horsepower means the year improvement in mpg shrinks as horsepower increases.
Remember that the same variable can play different roles! Take a
quick look at the variable cylinders, and try to use this
variable in the following analyses wisely. We all agree that a larger
number of cylinders will lower mpg. However, we can interpret
cylinders as either a continuous (numeric) variable or a
categorical variable.
cylinders as a
continuous/numeric variable. Is cylinders significant at
the 0.01 level? What effect does cylinders play in this
model?##
## Call:
## lm(formula = mpg ~ cylinders, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.241 -3.183 -0.633 2.549 17.917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.916 0.835 51.4 <2e-16 ***
## cylinders -3.558 0.146 -24.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.91 on 390 degrees of freedom
## Multiple R-squared: 0.605, Adjusted R-squared: 0.604
## F-statistic: 597 on 1 and 390 DF, p-value: <2e-16
## 2.5 % 97.5 %
## (Intercept) 41.27 44.56
## cylinders -3.84 -3.27
## [1] TRUE
## cylinders
## -3.56
The p-value for cylinders is <2e-16, hence cylinders
is statistically significant at the 0.01 level.
The slope estimate is -3.558, indicating that for each additional cylinder, the expected mpg decreases by approximately 3.558 on average.
The 95% confidence interval for the slop is \[(-3.84, -3.27)\].
The entire interval is negative, confirming a strong negative relationship betwene cylinders and mpg, further highlighted in the plot below.
## `geom_smooth()` using formula = 'y ~ x'
cylinders as a
categorical/factor. Is cylinders significant at the .01
level? What is the effect of cylinders in this model?
Describe the cylinders effect over mpg.##
## Call:
## lm(formula = mpg ~ cylinders_f, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.284 -2.904 -0.963 2.344 18.027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.550 2.349 8.75 < 2e-16 ***
## cylinders_f4 8.734 2.373 3.68 0.00027 ***
## cylinders_f5 6.817 3.589 1.90 0.05825 .
## cylinders_f6 -0.577 2.405 -0.24 0.81071
## cylinders_f8 -5.587 2.395 -2.33 0.02015 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.7 on 387 degrees of freedom
## Multiple R-squared: 0.641, Adjusted R-squared: 0.638
## F-statistic: 173 on 4 and 387 DF, p-value: <2e-16
## 2.5 % 97.5 %
## (Intercept) 15.931 25.169
## cylinders_f4 4.069 13.399
## cylinders_f5 -0.239 13.873
## cylinders_f6 -5.306 4.153
## cylinders_f8 -10.295 -0.879
## Anova Table (Type II tests)
##
## Response: mpg
## Sum Sq Df F value Pr(>F)
## cylinders_f 15275 4 173 <2e-16 ***
## Residuals 8544 387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 5 × 4
## cylinders_f n mean_mpg sd_mpg
## <fct> <int> <dbl> <dbl>
## 1 3 4 20.6 2.56
## 2 4 199 29.3 5.67
## 3 5 3 27.4 8.23
## 4 6 83 20.0 3.83
## 5 8 103 15.0 2.84
The p-value for cylinders is still <2e-16 even as a categorical variable, significant at the 0.01 level.
Looking at the table of mean and standard deviation for the different cylinder counts, we can see that 4 cylinder vehicles have the highest mean mpg and the fuel economy decreases as the numbe rof cylinders increases. We can see that the mean mpg for 3 cylinder vehicles is fairly low, which goes against the trend identified in part a). However, we see that there are only 4 observations with 3 cylinders, so this estimate is unstable and not representative of the true fuel economy of these vehicles. This is also true for 5 cylinder vehicles, for which there were only 3 observations.
Observing the box plot we can see that there are a lot of outliers for 6 cylinder vehicles, with one vehicle achieving greater mpg than the most fuel efficient 5 cylinder vehicle. 4 cylinder vehicles have the greatest range in mpg values, but also has the most number of observations.
cylinders as a continuous and categorical variable in your
models?The numerical model assumes that \[ E[mpg|cylinders]=\beta_0+\beta_1\times cylinders \]
This forces a linear trend and imposes a constant change in mpg per each additional cylinder added.
Meanwhile, the factor model assumes that \[E[mpg|cylinders=k]=\mu_k\]
This does not assume linearity, allowing each cylinder category to have its own mean.
mpg is linear
in cylinders vs. fit1: mpg relates to
cylinders as a categorical variable at .01 level?## Analysis of Variance Table
##
## Model 1: mpg ~ cylinders
## Model 2: mpg ~ cylinders_f
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9416
## 2 387 8544 3 871 13.2 3.4e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova was used to compare the linear and factor models. From the ANOVA table, we see that \[F=13.2,\qquad p=3.4\times 10^{-8}\]
Since \[3.4\times 10^{-8}<0.01\], we reject the null hypothesis at the 0.01 level, which states that the true relationship between mpg and cylinders was linear.
Hence, there is strong evidence that the relationship between mpg and
cylinder is not purely linear, proving that the categorical model
provides a much better fit than the linear model. Treating
cylinders as a factor variable (which it is), is more
appropriate.
Final modeling question: we want to explore the effects of each feature as best as possible. You may explore interactions, feature transformations, higher order terms, or other strategies within reason. The model(s) should be as parsimonious (simple) as possible unless the gain in accuracy is significant from your point of view.
## displacement weight horsepower
## displacement 1.000 0.933 0.897
## weight 0.933 1.000 0.865
## horsepower 0.897 0.865 1.000
We can see displacement is highly correlated with both weight and horsepower, meaning keeping all three variables would not be necessary and would increase standard errors and reduce interpretability. Hence, displacement will be excluded from the final model.
The final model is \[mpg \sim year+weight+horsepower+cylinders+origin\].
The line shown in the residuals vs fitted plot shows a slight curvature suggesting minor non-linearity. From the Q-Q plot we can see that there is a small increase in variance at higher fitted values, indicating slight heteroskedasticity. However, the vast majority of the points lie along the line suggesting that the model provides a strong and appropriate fit to the data.
##
## Call:
## lm(formula = mpg ~ year + weight + horsepower + cylinders_f +
## origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.442 -1.953 -0.063 1.563 12.786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.39e+03 9.64e+01 -14.46 < 2e-16 ***
## year 7.22e-01 4.88e-02 14.81 < 2e-16 ***
## weight -5.10e-03 5.06e-04 -10.07 < 2e-16 ***
## horsepower -2.54e-02 9.79e-03 -2.60 0.00977 **
## cylinders_f4 7.67e+00 1.61e+00 4.76 2.7e-06 ***
## cylinders_f5 8.39e+00 2.47e+00 3.39 0.00077 ***
## cylinders_f6 5.24e+00 1.68e+00 3.12 0.00192 **
## cylinders_f8 8.03e+00 1.79e+00 4.49 9.6e-06 ***
## originEurope 1.28e+00 5.22e-01 2.45 0.01455 *
## originJapan 2.21e+00 5.06e-01 4.38 1.6e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.12 on 382 degrees of freedom
## Multiple R-squared: 0.844, Adjusted R-squared: 0.841
## F-statistic: 230 on 9 and 382 DF, p-value: <2e-16
Year and weight have p-values <2e-16. A unit increase in year leads to a 0.722 increase in mpg on average, while each pound added to a vehicle’s weight decreases its mpg by -0.0051 on average. Each additional horsepower reduces mpg by about 0.0254, but the p-value is 0.00977, which is close to the 0.01 threshold, indicating the horsepower is less statistically significant that year and weight.
Treating cylinders as a categorical variable confirms that engine configuration affects MPG in a non-linear manner. Additionally, vehicles from Japan and Europe exhibit higher MPG (1.28 and 2.21 mpg respectively) relative to U.S. vehicles after controlling for mechanical characteristics.
Overall, fuel efficiency is strongly influenced by vehicle size, engine output, year, and country of origin.
mpg of the following car: A red car built
in the US in 1983 that is 180 inches long, has eight cylinders,
displaces 350 cu. inches, weighs 4000 pounds, and has a horsepower of
260. Also give a 95% CI for your prediction.Since colour, length and displacement are not included in the model, they are not used in prediction.
## fit lwr upr
## 1 19.5 12.9 26.1
Using the model described above, the predicted mpg of the vehicle is 19.5 mpg. The 95% confidence interval is \[(12.9, 26.1)\].
This exercise is designed to help you understand the linear model using simulations. In this exercise, we will generate \((x_i, y_i)\) pairs so that all linear model assumptions are met.
Presume that \(\mathbf{x}\) and \(\mathbf{y}\) are linearly related with a normal error \(\boldsymbol{\varepsilon}\) , such that \(\mathbf{y} = 1 + 1.2\mathbf{x} + \boldsymbol{\varepsilon}\). The standard deviation of the error \(\varepsilon_i\) is \(\sigma = 2\).
Create a corresponding output vector for \(\mathbf{y}\) according to the equation
given above. Use set.seed(1). Then, create a scatterplot
with \((x_i, y_i)\) pairs. Base R
plotting is acceptable, but if you can, please attempt to use
ggplot2 to create the plot. Make sure to have clear labels
and sensible titles on your plots.
lm() function. What are the true values of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\)? Do the estimates
look to be good?##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.662 -0.880 0.014 1.247 2.882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.331 0.557 2.39 0.022 *
## x 0.906 0.959 0.95 0.350
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.79 on 38 degrees of freedom
## Multiple R-squared: 0.023, Adjusted R-squared: -0.00272
## F-statistic: 0.894 on 1 and 38 DF, p-value: 0.35
## (Intercept)
## 1.33
## x
## 0.906
The true values for \(\beta_0 and \beta_1\) are 1 and 1.2 respectively.
The estimated values are: \(\hat\beta_0=0.906, \hat\beta_1=1.33\). The estimates are fairly close to the true value.
## [1] 1.79
The residual standard error of 1.79 is fairly close to the true value \(\sigma = 2\).
## 2.5 % 97.5 %
## -1.03 2.85
We have a 95% confidence interval of \((-1.03, 2.85)\), which includes the true value \(\boldsymbol{\beta}_1=1.2\).
## `geom_smooth()` using formula = 'y ~ x'
The black line is the true mean function. The blue line is the least squares fitted line. The two lines are fairly close, with minor deviations at lower x values.
The residuals vs fitted plot shows no systematic pattern or curvature. The residuals are randomly scattered around zero, with approximately constant spread, suggesting that the linearity and homoscedasticity assumptions are satisfied.
From the Q-Q plot we can see that several points fall below the reference line in the lower tail. However, this is expected for a small sample size of 40. The remaining points lie closely to the reference line.
Linear model assumptions are well satisfied in this sample.
This part aims to help you understand the notion of sampling statistics and confidence intervals. Let’s concentrate on estimating the slope only.
Generate 100 samples of size \(n = 40\), and estimate the slope coefficient from each sample. Also construct 95% confidence intervals for the slope.
## [1] 1.04
## [1] 1.1
## [1] 1.07
Under a simple linear model \[y_i=\beta_0+\beta_1x_i+\epsilon_i,\qquad \epsilon_i\sim N(0,\sigma^2)\] with fixed \(x_1,...,x_n\), the least squares slope satisfies \[\hat\beta_1\sim N\bigg(\beta_1,\frac{\sigma^2}{S_{xx}}\bigg),\quad where \ \ S_{xx}=\sum_{i=1}^n(x_i-\bar x)^2.\] In this simulation, we have \(\beta_1=1.2 \ and \ \sigma=2\), so
\[\hat{\beta}_1 \sim N\!\left(1.2,\; \frac{4}{S_{xx}}\right), \quad \text{and} \quad \mathrm{SD}(\hat{\beta}_1) = \frac{2}{\sqrt{S_{xx}}}.\] The simulated standard deviation of 1.1 is close to the theoretical value of 1.07, so the variability matches the theory well. However, the simulated mean for the slope of 1.04 is slightly below the true slope of 1.2. Theoretically we would have \(E[\hat\beta_1]=\beta_1\). This discrepancy can be improved by conducting more simulations, and the simulated mean will converge to the true value.
## [1] 0.96
Out of 100 simulated samples, 96 of the 95% confidence intervals contained the true slope \(\beta_1=1.2\). This empirical coverage rate of 96% is very close to the theoretical coverage of 95%.